High Flexibility Hybrid Architecture Real-Time Simulation Platform Based on Field-Programmable Gate Array (FPGA)

With the expansion of system scale and the reduction in simulation step size, the design of a power system real-time simulation platform faces many difficulties. The interactive operation of real-time simulation presents the characteristics of phased and centralized. This paper proposes selecting the appropriate simulation method for each sub-network according to the system operation requirements, and the sub-network simulation method can be changed with the change in system operation requirements in the simulation process. In order to change the sub-network simulation method in the simulation process, a high flexibility hybrid architecture real-time simulation platform based on FPGA was designed. The main body of the architecture runs in the high control mode of instruction flow and uses instruction flexibility to realize the requirement of changing methods. The algorithm modularity architecture is used as an auxiliary architecture to reduce the instruction cost and increase the computing power. Finally, the hybrid architecture real-time simulation platform was implemented in the Xilinx VC709 board (Xilinx corporation, San Jose, CA, USA), and the verification results show that under the same system scale, the hybrid architecture simulation platform combined with simulation method changing realizes shorter simulation step and complex interactive operation.


Introduction
Climate change has become an important topic of global environmental issues, and climate risk indices are increasing [1]. China has proposed a dual carbon target to reach the peak of carbon dioxide emissions by 2030 and achieve carbon neutralization by 2060. The power system is the hub of the energy chain and plays an important role in the emission chain [2]. In order to realize the optimal allocation of energy resources and achieve a dual carbon target, a large-scale AC/DC hybrid power system is the development trend in the future [3,4]. Real-time simulation plays an important role in verifying the control method of hybrid power systems, ensuring the safe operation of devices and developing new power electronic equipment. With the expansion of the power grid interconnection scale and the investment of new power electronic equipment, the simulation step size becomes smaller, and the simulation calculation burden becomes larger [5,6], which puts forward higher requirements for the performance of real-time simulation platform.
With the continuous emergence of high-performance computing devices, there are many design schemes for real-time simulation platforms. Parallel computer architecture is the mainstream of commercial real-time simulation platform, which is implemented by multi-core processors [7], or PC clusters [8]. The underlying hardware of these platforms is still serial execution; in order to achieve large-scale system simulation, a large number of underlying hardware needs to be set, which brings cost and communication problems [9]. Therefore, some underlying hardware with parallel computing capabilities, such as GPU and FPGA, are used as auxiliary hardware to form heterogeneous architecture and undertake part of the computing tasks [10][11][12]. Heterogeneous architecture platforms are implemented mainly by network decoupling, and each processor is responsible for computing a certain number of subsystems [13]. Coarse-grained parallel techniques, time synchronization and data communication between processors become the main problems of these simulation platforms; it is difficult to make full use of processor performance and has the problem of simulation accuracy. FPGA is a fully configurable device with distributed memory structure and pipeline structure, which can design specific architecture for the application. These advantages make FPGA the main hardware to participate in the real-time simulation of power systems.
The design architecture of electromagnetic transient real-time simulation platform based on FPGA can be divided into two kinds: algorithm modularity architecture (AMA) [14][15][16] and instruction flow driven architecture [17][18][19]. AMA establishes dedicated modules according to the algorithm form, and the modules are fixedly connected by the algorithm process in order. Under the guidance of the global control module, the start control of the special module is realized. AMA has the advantages of simple control and high computing efficiency. However, it is difficult to deal with some processes with branches and complex calculations; in addition, dedicated modules are often idle in the calculation process, which cannot make full use of the resources on the FPGA chip. The feature of instruction flow-driven architecture is to design a high reusability calculation unit based on the simulation method, and the data address and the operation type of the calculation unit are given by instruction to realize calculation. Instruction flow-driven architecture has strong flexibility and high utilization of computing units. However, the bandwidth requirement of instructions is too high, which makes it difficult to read instructions from external memory. Storing instructions on FPGA consumes a lot of memory resources, which limits the scale of the simulation.
In order to simulate a larger system scale and improve the performance of a real-time simulation platform, this paper designs a highly flexible hybrid architecture real-time simulation platform. Section 2 introduces the SSN simulation algorithm, analyzes the characteristics of real-time simulation operation and proposes to change the simulation method and data in the simulation process to expand the simulation scale. Section 3 illustrates the limitations of changing methods and data in the simulation process of the existing architecture and gives the design method of hybrid architecture. Section 4 verifies the effectiveness of the hybrid architecture simulation platform.

SSN Method
The simulation method with high parallelism and low computation is beneficial to expand the scale of real-time simulation. The SSN method selects some nodes to divide the system into multiple sub-networks [20]. The state-space method is used to solve the subnetworks, and the node equation is used to solve the system nodes. SSN method improves the computing parallelism of the system and balances the amount of computation between state-space and node equation, which shows its applicability for real-time simulation. The basic form of sub-network is shown in Figure 1. [14][15][16] and instruction flow driven architecture [17][18][19]. AMA estab ules according to the algorithm form, and the modules are fixedly c rithm process in order. Under the guidance of the global control mo of the special module is realized. AMA has the advantages of sim computing efficiency. However, it is difficult to deal with some pro and complex calculations; in addition, dedicated modules are often process, which cannot make full use of the resources on the FPGA instruction flow-driven architecture is to design a high reusability c on the simulation method, and the data address and the operation t unit are given by instruction to realize calculation. Instruction flow has strong flexibility and high utilization of computing units. How requirement of instructions is too high, which makes it difficult to r external memory. Storing instructions on FPGA consumes a lot o which limits the scale of the simulation.
In order to simulate a larger system scale and improve the perfo simulation platform, this paper designs a highly flexible hybrid arch ulation platform. Section 2 introduces the SSN simulation algorithm teristics of real-time simulation operation and proposes to change th and data in the simulation process to expand the simulation scale. S limitations of changing methods and data in the simulation process tecture and gives the design method of hybrid architecture. Sectio tiveness of the hybrid architecture simulation platform.

SSN Method
The simulation method with high parallelism and low compu expand the scale of real-time simulation. The SSN method selects the system into multiple sub-networks [20]. The state-space metho sub-networks, and the node equation is used to solve the system no proves the computing parallelism of the system and balances the am between state-space and node equation, which shows its applicabili lation. The basic form of sub-network is shown in Figure 1.  The port voltage u(t) is taken as the input variable vector of sub-network, and the port current i(t) is taken as the output variable vector of sub-network; the state-space equation is written for the sub-network and discretized by backward Euler method as follows [20]: where x(t) and x(t − ∆t) are the state variable vector at the current moment and previous moment. w(t) are the internal independent current source vector of the sub-network at the current moment. A k , B k , C k , D k , E k , F k are the coefficient matrix, the value of k is related to the running state of the sub-network. The automatic formulation methods for (1) and (2) can be found in [21]. Equation (1) is the state variable update formula, Equation (2) is the Norton equivalent expression of sub-network, and the first two terms on the right side of the equation are combined to become the internal injection source of the sub-network port i s (t): After the Norton equivalent circuit of each sub-network is calculated by Equation (3), the node voltage equation of the system can be constructed [20]: where G ex is the system equivalent conductance matrix, u ex (t) is the system node voltage vector, i ex (t) is the injection current vector of the system node. Equation (4) is still sparse and can be solved by the node elimination method, then update state variable of subnetworks by Equation (1) to complete a simulation step calculation.
In the case of the pre-stored A k , B k , C k , D k , E k , F k coefficient matrix, SSN has the advantages of less computation, simple process and high parallelism. However, in the real-time simulation, due to the requirements of hardware in the loop (HIL) or operator training, there are many interactive operations in the simulation system, such as fault settings and parameter settings. The variable network structure and parameters make the memory requirement of the coefficient matrix increase rapidly, which limits the simulation scale of SSN in a real-time simulation platform.

Variable Detailed Sub-Network
Components that can interact exist in almost every sub-network of the system, but these interactions do not occur at the same time. In the process of real-time simulation, the interactive operation of the system presents the characteristics of phased and centralized. During a continuous period of time, when performing fault tests and parameter settings on several sub-networks, other sub-networks hardly carry out any interactive operations. Therefore, at a certain simulation moment, it can be considered that the system has only a few sub-networks that can interact, which are called detailed sub-network, and other sub-networks cannot interact, which are called simple sub-network. The system that was originally a detailed sub-network representation turns into a system with a mixed representation of detailed sub-networks and simple sub-networks. By changing the detailed sub-network in the simulation process, each sub-network of the system can still realize interactive operation. This process is called a variable detailed sub-network.
After introducing the concepts of detailed sub-networks and simple sub-networks, SSN is also improved accordingly. Detailed sub-networks can set fault, load and operation status of the equipment in order to avoid consuming a lot of memory. The node analysis method, which is suitable for variable network structure and parameters, is used to solve the sub-network; the simple sub-network only has some state changes in switching elements and nonlinear elements, the memory requirement of the state-space method is acceptable. The state-space method is still used to solve the sub-network. It should be noted that the number of detailed subnetworks is limited to a certain simulation moment and accounts for a small proportion in the total number of sub-network, so the improved SSN method retains the characteristics of less calculation burden, balances the operation requirements and memory requirements and can simulate a larger scale system.
For the real-time simulation platform, the application difficulty lies in the variable detail sub-network. When the detailed sub-network is changed, the involved sub-network needs to change the simulation method, as shown in Figure 2. Changing the simulation method also brings the corresponding data preparation requirements, the state variable of two methods can be inherited, but the coefficients need to be changed. Therefore, the real-time simulation platform should be able to change the sub-network simulation method and coefficients during the simulation process. the sub-network; the simple sub-network only has some state changes in switching elements and nonlinear elements, the memory requirement of the state-space method is acceptable. The state-space method is still used to solve the sub-network. It should be noted that the number of detailed subnetworks is limited to a certain simulation moment and accounts for a small proportion in the total number of sub-network, so the improved SSN method retains the characteristics of less calculation burden, balances the operation requirements and memory requirements and can simulate a larger scale system.
For the real-time simulation platform, the application difficulty lies in the variable detail sub-network. When the detailed sub-network is changed, the involved sub-network needs to change the simulation method, as shown in Figure 2. Changing the simulation method also brings the corresponding data preparation requirements, the state variable of two methods can be inherited, but the coefficients need to be changed. Therefore, the real-time simulation platform should be able to change the sub-network simulation method and coefficients during the simulation process.

Hybrid Architecture Design Analysis
Changing the simulation method can be considered as the branch selection of the simulation program. The dedicated calculation units in the AMA are fixedly connected. The flexibility of the architecture is poor, and it is difficult to implement branch selection in the simulation process. Instruction flow-driven architecture works under instruction control; changing the simulation method can be realized by replacing instructions, which is easy to implement. In the simulation process, instruction RAM is read at any time. In order to avoid the impact of the instruction replacement process on the operation of the simulation platform, the ping-pong operation is used to complete the instruction replacement. The simulation method replacement process of instruction flow-driven architecture is shown in Figure 3. However, the ping-pong operation needs to consume twice as much memory, which is unacceptable for instruction flow-driven architecture with high

Hybrid Architecture Design Analysis
Changing the simulation method can be considered as the branch selection of the simulation program. The dedicated calculation units in the AMA are fixedly connected. The flexibility of the architecture is poor, and it is difficult to implement branch selection in the simulation process. Instruction flow-driven architecture works under instruction control; changing the simulation method can be realized by replacing instructions, which is easy to implement. In the simulation process, instruction RAM is read at any time. In order to avoid the impact of the instruction replacement process on the operation of the simulation platform, the ping-pong operation is used to complete the instruction replacement. The simulation method replacement process of instruction flow-driven architecture is shown in Figure 3. However, the ping-pong operation needs to consume twice as much memory, which is Ping-pong Download After the compiling software compiles the simulation method into instru downloads them to FPGA, the instruction driven architecture reads the instru decodes them after the start of the simulation step, and sends them to the high calculation unit to complete the expression operation until the end of the simu then continue with the same operation for the next simulation step. The finestructions are at the expression level, and it is convenient for discovering the of the algorithm and improving flexibility. However, the conversion of all ari pressions into corresponding implementation instructions brings high memory consumption. If there is a correlation between some arithmetic expre the correlation to design a dedicated calculation unit to achieve expression au calculation, instruction only controls the startup of the dedicated computing u the instruction cost can be reduced. This process is the introduction of AMA. duction of AMA needs to meet the following conditions:

•
The correlation arithmetic expressions shall have the same operation type the design of special computing units; • The correlation operation expression occupies a certain proportion in the method to avoid reducing the utilization of resources due to the idle of calculation units after the introduction of AMA.
Correlation arithmetic expressions that meet the above two conditions signed with algorithmic modular architecture. Then, AMA is used as the aux puting unit, which is introduced into the instruction flow-driven architecture hybrid architecture. The hybrid architecture retains the flexibility of the instruct architecture and uses AMA to be responsible for part of the calculation to redu tion cost and increase computing power so that variable detail sub-network SS can be applied.

Method Task Division and Analysis
The first problem to be solved in a hybrid architecture is to find the correl metic expression that meets the introduction conditions of AMA. By observing (1) and (3), it can be found that due to the simplification of the solution pro state space method, the Norton equivalent formula and the state variable upda After the compiling software compiles the simulation method into instructions and downloads them to FPGA, the instruction driven architecture reads the instructions and decodes them after the start of the simulation step, and sends them to the high reusability calculation unit to complete the expression operation until the end of the simulation step, then continue with the same operation for the next simulation step. The fine-grained instructions are at the expression level, and it is convenient for discovering the parallelism of the algorithm and improving flexibility. However, the conversion of all arithmetic expressions into corresponding implementation instructions brings high instruction memory consumption. If there is a correlation between some arithmetic expressions, use the correlation to design a dedicated calculation unit to achieve expression autonomous calculation, instruction only controls the startup of the dedicated computing unit so that the instruction cost can be reduced. This process is the introduction of AMA. The introduction of AMA needs to meet the following conditions:

•
The correlation arithmetic expressions shall have the same operation type to simplify the design of special computing units; • The correlation operation expression occupies a certain proportion in the simulation method to avoid reducing the utilization of resources due to the idle of dedicated calculation units after the introduction of AMA.
Correlation arithmetic expressions that meet the above two conditions can be designed with algorithmic modular architecture. Then, AMA is used as the auxiliary computing unit, which is introduced into the instruction flow-driven architecture to form a hybrid architecture. The hybrid architecture retains the flexibility of the instruction-driven architecture and uses AMA to be responsible for part of the calculation to reduce instruction cost and increase computing power so that variable detail sub-network SSN method can be applied.

Method Task Division and Analysis
The first problem to be solved in a hybrid architecture is to find the correlation arithmetic expression that meets the introduction conditions of AMA. By observing Equations (1) and (3), it can be found that due to the simplification of the solution process by the state space method, the Norton equivalent formula and the state variable update formula of the simple subnetwork are in the form of matrix-vector multiplication. Since most of the sub-networks in the system are simple sub-networks, Equations (1) and (3) account for a certain proportion in the simulation algorithm. Therefore, AMA can be introduced.
To analyze the dependence of Formula (1) and Formula (3) on other calculation formulas in the method, Equation (1) is divided into two parts: Energies 2021, 14, 6041 6 of 16 Equation (5) does not depend on the solution value of system node voltage. Combined with the framework in Figure 2, the variable detail sub-network SSN method can be divided into six tasks, and its dependencies are shown in Figure 4.
Equation (5) does not depend on the solution value of system node v bined with the framework in Figure 2, the variable detail sub-network SSN be divided into six tasks, and its dependencies are shown in Figure 4.  The node analysis method used in tasks 1 and 3 is strongly related to t process, and there are many types of expressions and complex data address solved by the node elimination method, the data are dependent on the calcul and the control is complicated. These tasks are suitable for instructions cont high resource utilization and complex calculations. From the observation can be found that tasks 1, 2, 3 and tasks 4, 5, 6 can be calculated in parallel which means that the idle rate of computing units in the hybrid architecture fore, the introduced AMA can be designed through tasks 4, 5 and 6.

Algorithm Modularization Architecture Design
The algorithm modularization architecture aims to complete the auton tion of tasks 4, 5 and 6 after receiving the start instruction. Before designin instructions and hardware modules of the architecture, it is necessary to d autonomous computing process. Unify the task form as follows: The three types of calculation tasks can be expressed in the form of multiplication = × k R H s. The purpose of a unified task form is to establish tonomous computing process. The autonomous operation process of definin tor multiplication is shown in Figure 5. By decomposing the matrix-vector m into several row operation subtasks, the row operation subtasks are high order to speed up the matrix-vector multiplication solution, multiple row o be solved in parallel. However, when the number of parallel row operations The node analysis method used in tasks 1 and 3 is strongly related to the model and process, and there are many types of expressions and complex data addressing; Task 2 is solved by the node elimination method, the data are dependent on the calculation process, and the control is complicated. These tasks are suitable for instructions control to achieve high resource utilization and complex calculations. From the observation in Figure 4, it can be found that tasks 1, 2, 3 and tasks 4, 5, 6 can be calculated in parallel, respectively, which means that the idle rate of computing units in the hybrid architecture is low. Therefore, the introduced AMA can be designed through tasks 4, 5 and 6.

Algorithm Modularization Architecture Design
The algorithm modularization architecture aims to complete the autonomous operation of tasks 4, 5 and 6 after receiving the start instruction. Before designing the startup instructions and hardware modules of the architecture, it is necessary to determine the autonomous computing process. Unify the task form as follows: The three types of calculation tasks can be expressed in the form of matrix-vector multiplication R = H k × s. The purpose of a unified task form is to establish the same autonomous computing process. The autonomous operation process of defining matrix-vector multiplication is shown in Figure 5. By decomposing the matrix-vector multiplication into several row operation subtasks, the row operation subtasks are highly parallel. In order to speed up the matrix-vector multiplication solution, multiple row operations can be solved in parallel. However, when the number of parallel row operations is insufficient at the end stage, the invalid operations need to be supplemented. The matrix dimension is generally a multiple of three due to the three-phase system. In order to reduce the number of invalid operations, three row operations are selected for parallel calculation.  According to the autonomous operation process, the data of the m n × coef matrix of k H show the law of sequential reading. The multiplied vector s is re quentially within row operations and circularly between row operations. If the d equation (7) can be stored in sequence according to the operation flow of Figure 5, th search in the autonomous operation of the matrix-vector multiplication task can b ized by providing several first addresses at startup.
The data storage and addressing process of Equation (7) is shown in Figure 6 coefficient matrix is only used inside the algorithm modularization architecture, stored in the local RAM, and because of the parallel calculation of three row oper local RAM is set as vector memory, read and write in SIMD mode. Most multiplied v need to communicate with the outside, so they are stored in the shared RAM. The plied vector of each row operation is the same, shared RAM is set as scalar memor autonomous operation of tasks ( ) t s i and int ( ) t x can be completed by providing cient matrix first address MA, multiplication vector first address MB, output first a MY, row operation length m and number n as shown in Figure 6b,c. The coeffic ( ) t x task cannot be continuously stored in the local RAM because the address of fixed in the local RAM while the value address of the k E coefficient matrix change the sub-network operating state k, an additional int x address MX is required to re coefficient matrix as shown in Figure 6d.
x Figure 5. Autonomous operation process of matrix-vector multiplication.
According to the autonomous operation process, the data of the m × n coefficient matrix of H k show the law of sequential reading. The multiplied vector s is read sequentially within row operations and circularly between row operations. If the data in Equation (7) can be stored in sequence according to the operation flow of Figure 5, the data search in the autonomous operation of the matrix-vector multiplication task can be realized by providing several first addresses at startup.
The data storage and addressing process of Equation (7) is shown in Figure 6a. The coefficient matrix is only used inside the algorithm modularization architecture, so it is stored in the local RAM, and because of the parallel calculation of three row operations, local RAM is set as vector memory, read and write in SIMD mode. Most multiplied vectors need to communicate with the outside, so they are stored in the shared RAM. The multiplied vector of each row operation is the same, shared RAM is set as scalar memory. The autonomous operation of tasks i s (t) and x int (t) can be completed by providing coefficient matrix first address MA, multiplication vector first address MB, output first address MY, row operation length m and number n as shown in Figure 6b,c. The coefficient of x(t) task cannot be continuously stored in the local RAM because the address of x int is fixed in the local RAM while the value address of the E k coefficient matrix changes with the subnetwork operating state k, an additional x int address MX is required to read the coefficient matrix as shown in Figure 6d.  According to the autonomous operation process, the data of the m n × coefficient matrix of k H show the law of sequential reading. The multiplied vector s is read sequentially within row operations and circularly between row operations. If the data in equation (7) can be stored in sequence according to the operation flow of Figure 5, the data search in the autonomous operation of the matrix-vector multiplication task can be realized by providing several first addresses at startup.
The data storage and addressing process of Equation (7) is shown in Figure 6a. The coefficient matrix is only used inside the algorithm modularization architecture, so it is stored in the local RAM, and because of the parallel calculation of three row operations, local RAM is set as vector memory, read and write in SIMD mode. Most multiplied vectors need to communicate with the outside, so they are stored in the shared RAM. The multiplied vector of each row operation is the same, shared RAM is set as scalar memory. The autonomous operation of tasks ( ) t s i and int ( ) t x can be completed by providing coefficient matrix first address MA, multiplication vector first address MB, output first address MY, row operation length m and number n as shown in Figure 6b,c. The coefficient of ( ) t x task cannot be continuously stored in the local RAM because the address of int x is fixed in the local RAM while the value address of the k E coefficient matrix changes with the sub-network operating state k, an additional int x address MX is required to read the coefficient matrix as shown in Figure 6d. From the above analysis process, the AMA is designed, as shown in Figure 7. The OP flag in the startup instruction is used to distinguish whether MX is used for coefficient matrix addressing. The address generation unit generates a data address in cooperation with the row and column counter, and the read-write controller completes the data readwrite operation of memory and the calculation unit port according to the address. The numbers in the figure are pipeline stages, the data in the floating-point arithmetic unit are double floating-point numbers, and the accumulation channel converts the double floating-point number into a fixed-point number to reduce the accumulation pipeline delay. In order to reduce the loss of accuracy, the fixed-point number is as wide as possible, and a 140 (40.100) bit fixed-point number is used in this paper. From the above analysis process, the AMA is designed, as shown in Figure 7. The OP flag in the startup instruction is used to distinguish whether MX is used for coefficient matrix addressing. The address generation unit generates a data address in cooperation with the row and column counter, and the read-write controller completes the data readwrite operation of memory and the calculation unit port according to the address. The numbers in the figure are pipeline stages, the data in the floating-point arithmetic unit are double floating-point numbers, and the accumulation channel converts the double floating-point number into a fixed-point number to reduce the accumulation pipeline delay. In order to reduce the loss of accuracy, the fixed-point number is as wide as possible, and a 140 (40.100) bit fixed-point number is used in this paper.

Hybrid Architecture Design
After introducing the AMA into the instruction flow-driven architecture, the hybrid architecture is shown in Figure 8. Multiple PE and AMA can be designed on FPGA, and the number is limited by FPGA resources.

Hybrid Architecture Design
After introducing the AMA into the instruction flow-driven architecture, the hybrid architecture is shown in Figure 8. Multiple PE and AMA can be designed on FPGA, and the number is limited by FPGA resources.  PE is a high reusability calculation unit, which needs to implement all formula types in tasks 1, 2 and 3. Statistical analysis is carried out for these task formula types. The formulas with high frequency and their descriptions are shown in Table 1. PE is a high reusability calculation unit, which needs to implement all formula types in tasks 1, 2 and 3. Statistical analysis is carried out for these task formula types. The formulas with high frequency and their descriptions are shown in Table 1. Table 1. Task formula type and description.

Formula Type Formula Description
Y =∑ A Injection current source calculation. Formation of system node voltage equation.
Historical current source calculation, back-substitution calculate the node voltage, branch current update.
Branch voltage calculation.
According to the formula type in Table 1, the PE structure is designed as shown in Figure 9. The numbers in the figure are pipeline stages, and A, B, C and D are calculation input channels; the Y channel completes the output of the main formula. The delay from each input channel to the Y channel makes the pipeline stages output from the Y channel the same under different data flow directions so as to ensure that there will be no conflict in the pipeline and cause flow interruption. The Z channel completes special operations and data transmission. Special operations include logarithmic operations, exponential operations and other operations with fewer occurrences and long pipeline length.

Ping-Pong Operation
The instruction storage is divided into two parts, the instruction storage of PE and the start instruction storage of AMA. The depth of the instruction RAM of the PE is large, which is related to the simulation step length Δt and the FPGA operating frequency f. The storage depth is 10,000 when the simulation step is 50us, and the operating frequency is 200 Mhz. AMA instruction RAM only stores the startup instruction, the transmission frequency of the start command is low, and its storage depth is set to 512 according to the minimum depth of the on-chip RAM. This also reflects the reduction in instruction costs. The coefficient memory is divided into two parts, PE stores the detailed sub-network coefficients, and AMA stores the simpler sub-network coefficients. When changing the detailed sub-network, the instruction RAM and PE Coe RAM perform a ping-pong operation at the end of the simulation step; AMA local RAM has a low proportion of changing coefficients, so the coefficients of all sub-networks can be stored to avoid ping-pong oper- PE instruction can be divided into two parts, the data address instruction of the port and control instruction. The port address instruction is sent to the read-write controller; The control instruction completes the operation type selection of the arithmetic unit.

Ping-Pong Operation
The instruction storage is divided into two parts, the instruction storage of PE and the start instruction storage of AMA. The depth of the instruction RAM of the PE is large, which is related to the simulation step length ∆t and the FPGA operating frequency f. The storage depth is 10,000 when the simulation step is 50 us, and the operating frequency is 200 Mhz.
AMA instruction RAM only stores the startup instruction, the transmission frequency of the start command is low, and its storage depth is set to 512 according to the minimum depth of the on-chip RAM. This also reflects the reduction in instruction costs. The coefficient memory is divided into two parts, PE stores the detailed sub-network coefficients, and AMA stores the simpler sub-network coefficients. When changing the detailed sub-network, the instruction RAM and PE Coe RAM perform a ping-pong operation at the end of the simulation step; AMA local RAM has a low proportion of changing coefficients, so the coefficients of all sub-networks can be stored to avoid ping-pong operation. When the parameter setting operation of the detailed sub-network occurs, the coefficient modification is also completed through the ping-pong operation of the PE Coe RAM. At the same time, the parameter modification detailed sub-network also changes the corresponding coefficients in the AMA through the download channel so that the correct coefficients can still be obtained when the sub-network becomes a simple sub-network.

Indexing Unit Design
Most of the data addresses in the instruction memory correspond to the position of the data in the memory and can be directly addressed by the address. However, some data will change during operation, such as the calculation coefficient of nonlinear components. The piecewise linearization strategy is adopted in this paper so that the coefficients will change according to the network state during operation. The detailed sub-network shows the change in the difference in the coefficient of the node analysis method. In a simple sub-network, the value of the entire coefficient matrix will change. These coefficients are stored in their respective coefficient RAM. The instructions in the instruction memory should be indexed according to the network operation state to find the correct coefficient address. This process is an indirect addressing process.
The instructions requiring indirect addressing are the port address of PE and the MA coefficient matrix address instructions of AMA. The indirect addressing circuit is shown in Figure 10. Store the indirect addressing coefficients at a fixed offset, and the influence word is the current state of the component. Indirect addressing can be completed by providing the first address, decoding method and influence word. These parameters are stored in the index guidance RAM; when indirect addressing is required according to the address range of the port instruction, the port address is taken as the index value of the index guidance RAM, and the parameter is taken out to complete the indexing. The index guidance RAM of PE performs ping-pong operation together with the Coe RAM of PE. in Figure 10. Store the indirect addressing coefficients at a fixed offset, and the influence word is the current state of the component. Indirect addressing can be completed by providing the first address, decoding method and influence word. These parameters are stored in the index guidance RAM; when indirect addressing is required according to the address range of the port instruction, the port address is taken as the index value of the index guidance RAM, and the parameter is taken out to complete the indexing. The index guidance RAM of PE performs ping-pong operation together with the Coe RAM of PE.

Architecture Design Scale
The Xilinx vc709 development board is selected to build a hybrid architecture realtime simulation platform. The FPGA chip used is xc7vx690t-2ffg1761. The chip contains 433,200 Slice LUTs, 3600 DSP chips and 1470 36K dual-port Bram. Taking 200MHz as the timing constraint, in order to compare the differences between different structures, the PE in Figure 9 is constructed in Architecture 1 to form an instruction flow driven architecture without instruction ping-pong, Architecture 2 implements instruction ping-pong based on the instruction flow driven architecture composed of PE and the Hybrid Architecture in Figure 9 is constructed in Architecture 1 to form an instruction flow driven architecture without instruction ping-pong, Architecture 2 implements instruction ping-pong based on the instruction flow driven architecture composed of PE and the Hybrid Architecture are designed, respectively. The maximum design scale and resource consumption are shown in Table 2. The statistics of parallel computing power are based on the number of input ports of PE and AMA. As shown in Table 2, the utilization rate of Bram resources under the maximum scale of the three architectures is close. Bram resources consume a lot of wiring resources. Under the timing constraint of 200 MHz working frequency, the wiring constraint is high, and the Bram with a high utilization rate limits the scale of architecture design. Architecture 1 has stronger parallel computing power, but the method cannot be switched during the simulation process. At the same time, it can be found that a large amount of architecture Bram consumption is instruction RAM, resulting in Architecture 2 limiting the design scale and reducing the parallel computing power after the instruction RAM ping-pong operation. By introducing AMA with low instruction RAM cost, the hybrid architecture can switch the method and improve the parallel computing power.

Method Validation and Architecture Validation
The four-machine AC/DC hybrid system shown in Figure 11 is selected as the simulation example. The rated AC voltage of the AC bus on the rectifier side is 345 kV, and the rated AC voltage of the AC bus on the inverter side is 230 kV. There are damping filter devices and capacitive reactive power compensation equipment on the converter buses on both sides. The structure and parameters of the DC part are designed according to CIGRE DC transmission system standard [22,23], and a 12 pulse thyristor converter is adopted for the rectifier and inverter. The rectifier side adopts constant current control, and the inverter side adopts constant current control, low-voltage current-limiting control and arc extinguishing angle control. Faults can be set everywhere in the system, such as line short circuit fault, thyristor fault setting, etc. on both sides. The structure and parameters of the DC part are designed according to CIGRE DC transmission system standard [22,23], and a 12 pulse thyristor converter is adopted for the rectifier and inverter. The rectifier side adopts constant current control, and the inverter side adopts constant current control, low-voltage current-limiting control and arc extinguishing angle control. Faults can be set everywhere in the system, such as line short circuit fault, thyristor fault setting, etc. The parameters of the equipment in the AC system are shown in Table 3. This paper does not discuss the division of sub-networks and follows the basic principle of division: the sub-network should be as large as possible, and the storage capacity should be within a reasonable range when solving with the state-space method. Synchronous generator involves coordinate transformation and does not participate in sub-net- Figure 11. The four-machine AC/DC hybrid system.
The parameters of the equipment in the AC system are shown in Table 3. This paper does not discuss the division of sub-networks and follows the basic principle of division: the sub-network should be as large as possible, and the storage capacity should be within a reasonable range when solving with the state-space method. Synchronous generator involves coordinate transformation and does not participate in sub-network division. The sub-network is divided, as shown in Figure 11. Table 4 shows the number of expressions and coefficient storage of each sub-networks under detailed sub-networks and simple sub-networks. It can be seen from Table 4 that the state-space method adopted by the sub-network has less computation; the simple subnetwork greatly reduces the number of faults to be stored in the state space equation. Therefore, the variable detail sub-network SSN method can speed up the network solution process and reduce the storage demand.
Architecture 1 adopts the node analysis method for the whole system, and Architecture 2 and hybrid architecture adopt variable detail sub-network SSN method. When selecting 8# subnetworks as detailed sub-network and other subnetworks as simple sub-networks, the completion time of the simulation example in the three architectures is shown in Table 5. Although Architecture 2 reduces the amount of computation through the variable detail sub-network SSN method, the decline of parallel computing power increases the simulation time. By introducing AMA, the hybrid architecture increases the computing power and shares some computing tasks, which can reduce the simulation completion time.

Simulation Results
In order to verify the accuracy of hybrid architecture real-time simulation platform based on FPGA, the simulation results are analyzed and compared with PSCAD simulation results.
Set 7# as a detailed sub-network at the beginning of the simulation. At t = 0.2 s, set the three-phase metallic grounding fault of the converter bus at the inverter side and clear the fault after 0.2 S. Figure 12 shows the waveform of AC current at the inverter side in the FPGA and PSCAD. At t = 5 s, set 5# as detailed sub-network and change 7# to a simple subnetwork. When t = 5.2 s, set the thyristor trigger pulse of the inverter side converter to be lost, and clear the fault after lasting for 0.2 S. Figure 13 shows the waveform of AC current at the inverter side in the FPGA and PSCAD. The correctness of the simulation platform designed in this paper can be verified by Figures 12 and 13. The main source of the error of the simulation platform is the accuracy loss in the process of AMA floating-point number conversion.

Conclusions
This paper proposed variable detail sub-network SSN method, the method changing in the simulation process not only balances the amount of calculation and storage but also can complete the complex test process in the real-time simulation process. In order to satisfy the flexibility of the method,a real-time simulation platform of hybrid architecture based on FPGA was designed.Compared with instruction flow-driven architecture, it improves the computing power while maintaining flexibility. The simulation platform has better adaptability to simulation systems with large-scale and complex test requirements. However, for a simulation system with low test requirements, it is difficult to fully utilize the performance of the simulation platform.
Observing the instruction cost consumption of hybrid architecture shows that PE instruction storage still occupies a lot of Bram resources, which limits the scale of architecture that can be designed in FPGA. The reduction in PE instruction storage can be studied from the instruction similarity. For example, the solution process of the node analysis method for the same equipment is the same, and the solution process of system node voltage equation can be fine-grained decomposed, instructions can be compressed through instruction similarity, which will be carried out in future research work.

Conclusions
This paper proposed variable detail sub-network SSN method, the method changing in the simulation process not only balances the amount of calculation and storage but also can complete the complex test process in the real-time simulation process. In order to satisfy the flexibility of the method,a real-time simulation platform of hybrid architecture based on FPGA was designed.Compared with instruction flow-driven architecture, it improves the computing power while maintaining flexibility. The simulation platform has better adaptability to simulation systems with large-scale and complex test requirements. However, for a simulation system with low test requirements, it is difficult to fully utilize the performance of the simulation platform.
Observing the instruction cost consumption of hybrid architecture shows that PE instruction storage still occupies a lot of Bram resources, which limits the scale of architecture that can be designed in FPGA. The reduction in PE instruction storage can be studied from the instruction similarity. For example, the solution process of the node analysis method for the same equipment is the same, and the solution process of system node voltage equation can be fine-grained decomposed, instructions can be compressed through instruction similarity, which will be carried out in future research work.

Conclusions
This paper proposed variable detail sub-network SSN method, the method changing in the simulation process not only balances the amount of calculation and storage but also can complete the complex test process in the real-time simulation process. In order to satisfy the flexibility of the method, a real-time simulation platform of hybrid architecture based on FPGA was designed. Compared with instruction flow-driven architecture, it improves the computing power while maintaining flexibility. The simulation platform has better adaptability to simulation systems with large-scale and complex test requirements. However, for a simulation system with low test requirements, it is difficult to fully utilize the performance of the simulation platform.
Observing the instruction cost consumption of hybrid architecture shows that PE instruction storage still occupies a lot of Bram resources, which limits the scale of architecture that can be designed in FPGA. The reduction in PE instruction storage can be studied from the instruction similarity. For example, the solution process of the node analysis method for the same equipment is the same, and the solution process of system node voltage equation can be fine-grained decomposed, instructions can be compressed through instruction similarity, which will be carried out in future research work.