A Review of Parallel Heterogeneous Computing Algorithms in Power Systems

: The power system expansion and the integration of technologies, such as renewable generation, distributed generation, high voltage direct current, and energy storage, have made power system simulation challenging in multiple applications. The current computing platforms employed for planning, operation, studies, visualization, and the analysis of power systems are reaching their operational limit since the complexity and size of modern power systems results in long simulation times and high computational demand. Time reductions in simulation and analysis lead to the better and further optimized performance of power systems. Heterogeneous computing—where different processing units interact—has shown that power system applications can take advantage of the unique strengths of each type of processing unit, such as central processing units, graphics processing units, and ﬁeld-programmable gate arrays interacting in on-premise or cloud environments. Parallel Heterogeneous Computing appears as an alternative to reduce simulation times by optimizing multitask execution in parallel computing architectures with different processing units working together. This paper presents a review of Parallel Heterogeneous Computing techniques, how these techniques have been applied in a wide variety of power system applications, how they help reduce the computational time of modern power system simulation and analysis, and the current tendency regarding each application. We present a wide variety of approaches classiﬁed by technique and application.


Introduction
The growth in energy dependence has led to increased complexity in the planning and operation of power systems [1]. In some scenarios, systems are observed operating at stressed conditions that, in the event of failures, can result in blackouts or cascading events resulting in large economic losses. For this reason, multiple studies have been carried out to evaluate the possible conditions that may appear on the network and avoid undesired power losses. As a result of yearly studies, operational criteria is generated to maintain system security under various conditions.
Operation considerations and study area limitations are not commonly updated during operation and short term planning, as a result of significant processing efforts and decision making requirements. In order to reduce execution times and to guarantee the optimal and secure operation of the network, different strategies have been implemented; some of the approaches reduce the size of a problem's formulation, the number of impacted areas during the analysis, the number of decision variables, etc. Other options use the potential of current technologies, such as advanced computational structures.
The application of advanced computational structures during the operation and shortterm planning of power systems has been integrated more frequently during the last fifteen works of PHC techniques in power systems since 80% of the articles published during the last five years correspond to GPU implementations.
This document is organized as follows. Section 2 presents an overview of the PHC techniques applied to power system studies and analysis. Section 3 organizes all works by technology and power system applications. Section 4 presents the conclusions, and Section 5 outlines the future work of PHC in power system applications.

Parallel Heterogeneous Computing
As defined in the previous section, PHC refers to systems that include processors of different types, such as CPUs, GPUs, and FPGAs interacting in on-premise or cloud environments, where the system architecture allowed processors to execute multitasks in parallel. The authors in ref. [8] categorized parallel computing based on how instructions and data are handled. Single-Instruction, Single-Data (SISD), SIMD, and MIMD represent the classification of parallel computing. SISD refers to serial computing where a single instruction is executed on a single data stream at a time.
SIMD refers to the processing of a single instruction concurrently on multiple data elements. Finally, MIMD defines a computing strategy where multiple processing units exist with their own instructions and data. This parallel computing technique typically employees MPI to communicate the processing units. MIMD is also known as multicomputers. This section presents different processing units used in PHC implementations in on-premise and cloud environments in power system applications.

Central Processing Unit
Multiple CPUs configured in workstations stand as the conventional PHC technique since they combine different processing units that perform tasks simultaneously. This technique requires a communication network to connect all nodes of the system. This technology is a clear example of the MIMD parallel technique. However, MIMD can be seen as SISD implementation in independent workstations connected by a communication network. As defined in ref. [8], the CPU is the fastest component of a computer that consists of a group of registers that contain the instructions to be executed. Those instructions are sent to the hardware to perform tasks, such as fetching, storing, or operating data. Registers are usually different for instructions, addresses, and operands. CPUs can contain components to accelerate the processing of floating-point numbers.

Fog and Cloud Computing
PHC techniques can run on-premise or in cloud environments. Most the CPU workstations run in on-premise environments where researchers grant not only the group of computers but also the communication network the ability to exchange instructions and data between the system nodes. Regarding cloud environments, fog and cloud computing have been used as distributed and centralized infrastructures that host and connect the processing units of PHC implementations. Running on cloud environments, nodes with CPU and GPU are configured to execute a wide variety of tasks. Multiple nodes represent the MIMD parallel computing technique.
Cloud computing refers to the on-demand provisioning of system services, such as multiple software, servers, data storage with the corresponding databases, and communication networks accessed remotely through the internet. Fog computing complements cloud computing giving the possibility of cloud applications to use on-premise resources to perform part of the application computation, storage, and communication. In other words, fog computing refers to an additional layer of cloud computing where on-premise resources are used. Within the services provided by this environment, cloud computing can integrate a wide variety of HC technologies and define schedulers to execute tasks in parallel.

Field-Programmable Gate Array
PHC systems can combine not only CPUs and GPUs but also integrated circuits, such as FPGAs. PHCs can take advantage of the high-performance programmable logic of Application-Specific Integrated Circuit (ASIC) designs. FPGAs are devices based on a matrix of logic blocks connected through programmable interconnections. After manufacturing, FPGAs are systems that can be reprogrammed to fit any application. This flexibility helps to parallelize the tasks of power system applications in HC systems with the MIMD approach. These devices can be integrated into on-premise or cloud environments as one of the provided services.

Graphics Processing Unit
The first decades in microprocessor advancements were focused mainly on serial workloads [11,12]; whereas more recently, CPUs have evolved to provide hardware that seizes parallelism through pipelining and even multi-core architectures. However, as pointed out by ref. [11], most of the circuitry on a CPU is devoted not to arithmetic, logic, and direct parallelism execution but to control complexity, such as caches, instruction decoders, and branch predictors, among others.
Instead, better results can be achieved with another paradigm, namely that of a GPU, which was developed using the general concept of vector processors. These vector processors allow the hardware to operate using registers that contain separate values [12]. Due in part to gaming and other consumer application demand over the last two decades, manufacturers, such as NVIDIA and AMD, now provide affordable GPU computing products on the market [11]. Some providing tools, such as CUDA, make scientific computing programming with GPUs an increasingly easier task.
GPUs are powerful co-processors designed to exploit data parallelism based on the SIMD approach. Modern GPU architectures embed hundreds to thousands of computing cores along with dedicated units. Several of the components are part of its own memory hierarchy, such as memory banks and caches. These elements are organized into so-called Streaming Multiprocessors (SM), where there is a warp/dispatch scheduler (more generally a thread scheduler), which distributes threads to be executed at different Scalar Processors (SPs), which can be either Integer SPs or Floating Point SPs to manage integer and float operations, respectively.
In a HC GPU structure, it is the host CPU system that is able to launch tasks in a device GPU system. This commonly involves data exchange (copy and paste operations) to and from the device's global memory (normally RAM comprised by the graphics card). There is also a thread-block scheduler that distributes blocks of threads to different SMs for their execution. Table 1 presents the review summary, specifying the classification and distribution of all the articles based on the power system application and the PHC technology. The summary also organizes the applications based on the number of references, leading to the present power flow analysis as the application where researchers have parallelized most of the power flow algorithm using heterogeneous computing. Figure 1 shows how PHC technologies have been used during the last three decades. In the beginning, CPU clusters, FPGAs, and fog and cloud computing were the most popular PHC alternatives. During the last ten years, vector processors, such as GPU, have had significant growth in power system applications since GPU represents 80% of all the articles included in the review and published during the last five years.   Figure 2 shows how GPU has been used in power system studies and applications. This technology uses the SIMD technique, and applications take advantage of this quality to vectorize parts of algorithms and studies.

Power Flow Analysis
As shown in Figure 2, power flow analysis is the application where GPU has been the most used ref. [6,. Researchers have attempted to reduce the power flow convergence time by testing multiple numerical algorithms for the solution of linear equation systems, parallelizing the algorithm steps in the GPU. Depending on the iterative algorithm used to find the power flow solution, tasks, such as building the admittance and Jacobian matrices of large-scale power systems, have been parallelized taking advantage of the GPU architecture. In ref. [18], a multigrid-preconditioned conjugate gradient method was implemented in a GPU where the Direct Current (DC) analysis of power systems was accelerated, thus, improving the convergence rate of the conjugate gradient algorithm with the multigrid preconditioning method. In ref. [28], the convergence rate of the conjugate gradient method was improved with a polynomial Chebyshev preconditioner integrated into a GPU-based conjugate gradient solver. In ref. [35], a fast decouple power flow algorithm integrated with the Inexact Newton method was solved with a GPU-based preconditioned conjugate gradient solver with a two-step preconditioner based on a diagonal Jacobi preconditioner and a polynomial Chebyshev preconditioner. In ref. [37], the Newton-Raphson and Gauss-Seidel power flow solvers were accelerated using a GPU where efficient data-oriented parallel primitives, such as map function, reduction, and scan, were used in different steps of both algorithms.

Transient Stability
Transient stability appears as the application where GPU is second-most used refs. [18,[61][62][63][64][65][66][67][68][69][70][71][72][73][74]. In refs. [62,64], the nonlinear Differential-Algebraic Equations (DAEs) that represent the dynamic behavior of power systems were discretized and vectorized. Then, the slow coherency method was used to divide the system to solve the transient stability simulation in multiple processors at the same time. These modifications allow applying the SIMD computation to solve the power system DAES in parallel in a hybrid multi CPU-GPU with a parallel sparse matrix solver.
In ref. [171], the electromagnetic simulation includes large-scale control systems dividing the algorithm into two parts: heterogeneous computing that computes control signals and a Norton equivalent circuit for non-linear electrical components and homogeneous computing for the electrical components represented by linear and time-discrete Norton equivalent circuits. The currents are computed in a GPU based on a a Layered Directed Acyclic Graph (LDAG) that sequentially links parallel-primitives and fused multiply-add that performs an add operation for the linear models. Finally, node voltages are calculated with the impedance matrix of the system.

Smart Grids
Smart grids appear as the next application where GPUs are the most used refs. [124][125][126][127][128][129][130][131][132][133][134]. The authors in ref. [125] presented a survey with the applications and trends of HPC for electric power systems where GPU emerged as one of the technologies for real-time and off-line smart grid simulation and visualization. An adaptive dispatch for smart grids was presented in ref. [131] where a wavelet recurrent neural network was implemented in a cloud-distributed GPU architecture to predict the optimal dispatch.

Contingency Analysis
Following smart grids, contingency analysis is the next application where GPU is used to reduce the simulation time of power flow analysis [99][100][101][102][103][104][105][106][107][108]. Depending on the simulation time reduction achieved, hybrid CPU-GPU solutions can evaluate a high number of contingencies and scenarios to find an optimal performance of power systems. The authors in ref. [102] presented a strategy to accelerate DC contingency screening where the following tasks are parallelized in a GPU: calculation of the node voltages and detection of overloaded elements after an outage of branches and generators.

Optimal Power Flow
As seen for contingency analysis, GPU is used for OPF to reduce the simulation time of power flow analysis [1,[147][148][149][150][151][152][153][154][155]. OPF has been implemented using hybrid CPU-GPU algorithms where common instructions are parallelized using the SIMD architecture of GPUs. Metaheuristic methods and the Newton-Raphson algorithm are two of the strategies most often parallelized for OPF application. The metaheuristic methods are used to find the power system's optimal operating point, considering the state and control variables and the optimization constraints (node voltages and generation limits).
All tasks performed in the GPU are constituted by common instructions that fit the SIMD technique. Data transfer tasks between the CPU and GPU generate a bottleneck since the communication channel's bandwidth between both processing units is low. One of the challenges of CPU-GPU algorithms is minimizing the number of data transfer tasks to avoid bottlenecking as much as possible for the communication channel.
The authors in ref. [150,155] implemented an OPF with a metaheuristic method and the Newton-Raphson algorithm using a CPU-GPU platform. Particle Swarm Optimization (PSO) is used to optimize the total generation cost, transmission losses, and pollutant emissions of large-scale power systems. The initialization of the particle position and velocity, calculation of the fitness of all particles, the particle's best position update, the swarm's best position update, and the movement of all particles for all PSO iterations are parallelized in a GPU.
Regarding power quality, [208] presented an optimized method for recognizing power quality disturbances based on the modified S transform and parallel stacked sparse autoencoder. Finally, parameter identification of the dynamic models of electrical components is accelerated using GPU. The authors in ref. [211] presented an accelerated parameter identification of permanent magnet synchronous machines using a PSO algorithm parallelized in a GPU.
GPU has helped with the following tasks: to optimize the large-scale design of electric vehicles [217], to accelerate probabilistic power flow computation based on the Monte-Carlo simulation with simple random sampling [223], to visualize real-time power system contouring based on a power grid digital elevation model [233], to forecast power system demand improving the data training of an artificial neural network with a multi-layer perceptron architecture using the Levenberg-Marquardt learning method [240], to approximate the solution of large differential-algebraic equations for small-signal stability with four methods (Chebyshev discretization, time integration operator discretization, linear multistep, and Padé approximants) [241], and to speed up the short-circuit current calculation of large-scale power systems with a batch solution where the admittance matrix inverse, short-circuit current of the specified node, node voltages, and branch currents are calculated in parallel with the SIMD technique [234].
Regarding TSCOPF, ref. [229] presented the parallelization of an OPF evaluating steady-state and transient-state constraints where fuel costs were optimized. The transientstate constraints for each contingency were computed in the GPU. The rest of the algorithm ran in the CPU. For SCOPF application, no reference was found where a hybrid CPU-GPU algorithm had been implemented to reduce the execution time of either power flow simulations or the contingency evaluation.

CPU Clusters, Fog and Cloud Computing, and FPGA
On the other hand, Figure 2 shows how PHC technologies, such as CPU clusters, fog and cloud computing, and FPGA, have been used in power system studies and applications. These technologies do not use the SIMD technique since the corresponding architectures change widely from the GPU architecture.

Transient Stability
In contrast to GPU applications, transient stability analysis is the application where the mentioned PHC technologies are most used ref. . In ref. [75], the very dishonest Newton method and the successive over-relaxed Newton method were implemented in parallel to perform a transient stability analysis using the local memory of the Intel supercomputer iPSC/2 and the shared memory of an Alliant FX/8 computer system. The transient stability analysis was executed once the differential machine equations are discretized using the trapezoidal rule.
In ref. [82], a parallel transient stability simulation was performed using the Interlaced Alternating Implicit (IAI) algorithm, a multilevel partition scheme to divide the power systems into subsystems and a hierarchical block bordered diagonal form algorithm to independently solve the DAES of each subsystem. Each subsystem was solved in a processor. The authors in ref. [96] performed a dynamic simulation of large-scale power systems using the Schur complement domain decomposition method based on the shared memory parallel programming model. The decomposition method helps to divide power systems into reduced subsystems to solve each subsystem as an individual problem using the shared memory of multi-core computers and OpenMP.

Contingency Analysis
Following transient stability analysis, contingency analysis stands as the second application where PHC technologies are the most used ref. [94,[109][110][111][112][113][114][115][116][117][118][119][120][121][122][123]. The authors in ref. [117,121] showed how the master-slave asymmetric communication model was used to analyze thousands of contingencies in large-scale power systems. The master schedules the algorithm tasks among slaves to evaluate all contingencies using proactive task scheduling and stealing methodologies to optimize the load balancing of workers (slaves). This methodology allows for the start of a contingency evaluation as soon as one work is finished without waiting for any instruction from the master. The communication between master and workers is achieved using MPI, where workers run in multi-threads in different processors.

Power Flow Analysis
After contingency analysis, power flow analysis appears with more PHC references [47][48][49][50][51][52][53][54][55][56][57][58][59][60]. The authors in ref. [47] shows a parallel LU factorization and substitution algorithm to solve large sparse matrix equations using the shared memory of 20 parallel multi-processor computers. The parallel factorization and substitution are used to optimize the execution time of a power flow analysis based on Newton's and Fast Decoupled methods. In ref. [52], a parallel power flow solution based on the Newton-Raphson algorithm was presented. The Jacobian matrix was divided to parallelize the LU factorization. The proposed algorithm was implemented in a multiprocessor on a system on a programmable chip computer board containing an FPGA.

Smart Grids
Smart grids are the next application with PHC references [135][136][137][138][139][140][141][142][143][144][145][146]. The authors in ref. [136] presented how cloud computing can manage the smart grid data measured from front-end intelligent devices. Cloud computing fits big data applications since it provides scalability, agility, and flexibility. A data security solution was proposed based on identity-based encryption, signature, and proxy re-encryption. In ref. [138], the day-ahead energy resource scheduling of a smart grid with high penetration of distributed generation and electrical vehicles was optimized to satisfy the demand for sensitive loads.
The optimization was developed with a multi-objective model constituted by a PSO and a deterministic technique based on Mixed-Integer Linear Programming. The objective functions are the distributed generator energy production costs, the external suppliers' energy costs, demand response program costs, the energy storage system and electrical vehicle discharging costs, the non-supplied demand costs, and the generation curtailment power costs. Each optimization problem defined in the multi-objective model was solved in an independent computer core.

Optimal Power Flow and Security Constrained Optimal Power Flow
The next applications with usage of PHC technologies are OPF refs. [156][157][158][159][160][161][162][163] and SCOPF refs. [225][226][227][228]. The authors in ref. [156] presented a parallel OPF solution developed in a network of workstations (CPU-cluster). The power system and the OPF problem was divided into geographical regions. Transmission lines that interconnect regions were divided into two lines connected to a dummy bus. Active and reactive power flows and voltage magnitudes and angles were defined as variables in the OPF problem for each area. The objective function for each area neglects the rest of the system and includes the cost of each generator that belongs to the corresponding area. Variables corresponding to dummy buses are modeled with dummy generators. Each OPF is defined with an interior point method. The solution is implemented in a network of seven Sun UltraSparc workstations.
In ref. [227], the decomposition of a SCOPF problem using the Bender decomposition was implemented. The Bender decomposition divides the SCOPF into one master problem (OPF) and N subproblems where a total of N contingencies are evaluated. Subproblems check the solution feasibility for the master problem. The master solution corresponds to the overall problem solution if the subproblems are feasible, and all control provided by the master problem does not violate any constraint for the post-contingency state.
Since subproblems are independent, they are solved in parallel and minimize required adjustments of preventive controls. When the subproblem is infeasible, instead of preventive control adjustments, it passes cut constraints to the master problem. The feasibility cut is added to the master problem. If there is any violation in the master problem under any contingency, the control variables for the violation are sent to the subproblems to retrieve the corresponding Bender cut.

Other Applications
With less than four references per application, PHC technologies have been used in dynamic state estimation [94,200,201], power system planning [236,237], TSCOPF [230,231], power system reliability [238,239], electromagnetic transient simulation [91,92], dynamic models [2,216], short circuit analysis [235], security constrained economic dispatch [242], probabilistic power flow [224], and hydro thermal scheduling [243]. In ref. [201], a power system is divided into multiple areas, and the state estimation of each area is performed individually in each area processor. Areas exchange border information through the coordinator processor (central processor). The proposed method attempts to define areas with similar sizes to balance the workload of area processors as much as possible. The state estimation is implemented in a cluster of computers. The high-performance portable implementation of MPI (MPICH2) is used to communicate the coordinator and area processors.
The researchers in ref. [236] presented the implementation of a parallel genetic algorithm to optimize the integration of different types of generation units during different time intervals. The generation planning method was implemented on a cluster of transputers. The coarse-grain version of the parallel genetic algorithm was implemented, where distributed subpopulations were optimized into several processes, and information was exchanged between subpopulations if required.
In ref. [230], the implementation of a parallel Differential Evolution (DE) algorithm to improve TSCOPF computation time was presented. The developed DE algorithm combines the time-domain simulation and the Transient Energy Function (TEF) method. The objective function corresponds to the generating fuel cost of the power system. The algorithm evaluates the power system stability under the specified contingencies where the transient event during a fault condition and an acceptable steady-state operating condition are considered.
First, the time-domain simulation calculates the generator rotor angles. Then, the TEF method computes the transient energies to determine the system stability. The DE algorithm is implemented on a Beowulf CPU-cluster with one control node and 30 working nodes connected with the MPI protocol. The initial population is divided into subpopulations. Each working node performs an individual DE algorithm to the corresponding subpopulation where the load flow calculation, fitness evaluation, transient stability assessment, and the selection are executed. The control node oversees the initialization, reproduction, and update of the global best individual.
In ref. [238], a parallel metaheuristic TSCOPF was implemented to determine the minimal investment costs to satisfy power system reliability constraints. The authors in [91] analyzed and compared the instantaneous relaxation and direct method solvers for real-time transient stability simulation and variations of nodal and state-space solvers for real-time electromagnetic transient simulations. The tests were performed in multi-core and multi-processor computers and FPGA. In ref. [216], a parallel PSO was implemented to estimate the parameters of a wide variety of photovoltaic models. The parallel PSO was implemented in OpenCL. The evaluation of all particles was performed at the same time in multi-processor devices.
In ref. [235], the probability density curves of short-circuit levels were obtained based on running a Monte-Carlo simulation in parallel in multiple virtual machines. Regarding Security Constrained-Economic Dispatch, ref. [242] presented a multithread solution of a power system economic dispatch based on the Multi-Thread Interior Point Barrier algorithm considering the generator limitations, transmission losses, and nonlinear cost functions. The algorithm validates that the optimal dispatch satisfies line flow limits adding line flow constraints as security constraints.
The authors in ref. [224] presented an implementation of real-time probabilistic power flow based on a parallel Monte-Carlo simulation run in multicore CPUs. The method allows for the evaluation of uncertainties related to renewable energy resources. Finally, in ref. [243], a parallel differential evolution algorithm was implemented to optimize the short-term schedule of a hydrothermal generator unit considering power flow constraints. The method divided a large population into subpopulations where each subpopulation searched for the optimal solution individually in an exclusive processor.

Conclusions
In this paper, we presented a review of PHC techniques in power systems applications, such as power flow analysis, transient stability, contingency analysis, and smart grids, among others. The review covered more than 200 works where technologies, such as CPU clusters, FPGA, hybrid CPU-GPU platforms running on on-premise and cloud environments, have been used to improve modern power system planning, operation, studies, visualization, and analysis.
Power flow analysis is the application where PHC techniques have been most used, followed by transient stability and contingency analysis. Regarding power flow analysis, different technologies have been adopted to accelerate the tasks of the power flow algorithm, such as building the admittance matrix, or-regarding the Newton-Raphson method-computing the Jacobian matrix and updating the state variables. On the other hand, applications, such as electricity market analysis, small-signal analysis, security constrained-economic dispatch, and hydrothermal scheduling, are the situations where PHC techniques are not frequently used since they are the applications with the lowest number of references.
The review is organized by application and two groups of technologies: the first group corresponds to works that use GPU, and the second group refers to works where CPU clusters, FPGA, and fog and cloud computing are used. This classification shows that GPU is the most-used PHC technique in power system studies and analysis since works that use this technology are more than half of all the articles presented by the review. GPU stands as the current tendency since 80% of the articles published during the last five years correspond to GPU implementations.
This tendency shows that researchers are taking advantage of the GPU architecture designed to exploit data parallelism using the SIMD parallel computing approach. The development of toolkits and frameworks to program GPU instructions in non-graphic programming languages has helped to expand the role of the GPU in power system applications. Finally, the wide number of applications that have used PHC techniques present PHC as a proper alternative for studying and analyzing modern power systems, due to the flexibility and adaptability of these techniques.

Future Work
As presented in the review, one of the most important characteristics of PHC techniques is the flexibility and adaptability of the technologies to a wide variety of applications. However, PHC techniques are not exclusive to the technologies presented in this article (CPU, FPGA, and GPU in on-premise and cloud environments). This provides the possibility to adopt new technologies, such as quantum computing, which continue to arise and make their way into applied sciences and engineering applications, into PHC implementations.
Regarding parallel computing, the fact that vector processors can compute on several data streams with a single instruction represents a significant advantage over scalar processors. Consequently, technologies, like GPUs, which is the current tendency, represent the future trend for parallel computing. As shown in the review, GPUs can interact properly with CPUs, and thus hybrid CPU-GPU platforms bring significant advantages that have begun to help improve power system operation.
Regarding infrastructure cost and maintenance and considering cloud computing advantages, combining new computing tendencies, such as quantum computing and advances in vector processors (improvement in GPUs), in cloud environments allow researches to use a wide number of resources to not only improve the power system performance but also to find further applications where PHC can be used to grant better results.