High-Performance and Parallel Computing Techniques Review: Applications, Challenges and Potentials to Support Net-Zero Transition of Future Grids

The transition towards net-zero emissions is inevitable for humanity’s future. Of all the sectors, electrical energy systems emit the most emissions. This urgently requires the witnessed accelerating technological landscape to transition towards an emission-free smart grid. It involves massive integration of intermittent wind and solar-powered resources into future power grids. Additionally, new paradigms such as large-scale integration of distributed resources into the grid, proliferation of Internet of Things (IoT) technologies, and electrification of different sectors are envisioned as essential enablers for a net-zero future. However, these changes will lead to unprecedented size, complexity and data of the planning and operation problems of future grids. It is thus important to discuss and consider High Performance Computing (HPC), parallel computing, and cloud computing prospects in any future electrical energy studies. This article recounts the dawn of parallel computation in power system studies, providing a thorough history and paradigm background for the reader, leading to the most impactful recent contributions. The reviews are split into Central Processing Unit (CPU) based, Graphical Processing Unit (GPU) based, and Cloud-based studies and smart grid applications. The state-of-the-art is also discussed, highlighting the issue of standardization and the future of the field. The reviewed papers are predominantly focused on classical imperishable electrical system problems. This indicates the need for further research on parallel and HPC approaches applied to future smarter grid challenges, particularly to the integration of renewable energy into the smart grid.


Introduction
To date, the global mean temperature continues to rise, and emissions continue to grow, creating great risk to humanity. Efforts and pathways are drawn by many jurisdictions to limit warming to 2 • C and reach net zero CO 2 emissions [1]. This is largely due to the outdated electrical energy system operation and infrastructure, which causes the largest share of emissions of all sectors. Electrical energy systems are, however, witnessing a transition, consequently causing a growth in the scale and complexity of their planning and operation problems. The changing grid topology, decarbonization, electricity market decentralization, and grid modernization mean innovations and new elements are continuously introduced to the inventory of factors considered in grid operation and planning. Moreover, with the accelerating technological landscape and policy changes, the number of potential future paths to Net-Zero increases, and finding the optimal transition plan becomes an inconceivable task.
The use of parallel techniques becomes inescapable, and defaulting HPC competence by the electrical energy and power system community is inevitable in the face of the presumed future and its upcoming challenges. With some algorithmic modifications, parallel tion, Losses Reduction, Power System Simulation, and Analysis and Control Room Visualization, as seen in [33,34]. However, for operational purposes where problems need to be solved on a short time horizon, system models are usually simplified, and heuristic methods are used, relying on the experience of operators, such as in [35]. As a consequence, these models tend to be conservative in fashion, reaching a reliable solution at the expense of reduced efficiency [36]. According to The Federal Energy Regulatory Commission, finding more accurate solution techniques for complex problems such as AC Optimal Power Flow (ACOPF) could save billions of dollars [37]. This motivates the search for methods to produce high-quality solutions in a reasonable time, and one of the ways to push the wall of speed up is through parallel techniques and HPC. Finding appropriate techniques, formulations, and proper parallel implementation of HPC for electrical system studies has been a research area of interest. Progress has been made to make solving complex, accurate power system models for real-time decisions favorable.
The first work loosely related to parallelism on a high-level task in a power system might have been that of Harvey H. Happ in 1969 [38], in which a hierarchical decentralized multicomputer configuration was suggested, targeting Unit Commitment (UC) and Economic Dispatch (ED). Other similar work in multi-level multi-computer frameworks followed, soon targeting Security and voltage control in the 1970s [39][40][41]. P. M. Anderson from the Electrical Power Research Institute created a special report in 1977, collecting various studies and simulations performed at the time, which explored the potential applications for power system analysis on parallel processing machines [42]. Additionally, several papers came out suggesting new hardware to accommodate power-system-related calculations [43].
In 1992, C. Tylasky et al. made what might be the first comprehensive review of the state-of-the-art of parallel processing for power systems studies [44]. They discussed challenges still relevant today, such as different machine architectures, transparency, and portability of the codes used. A few parallel power system study reviews have been conducted throughout the development of computational hardware. Some had similar goals to this paper reviewing HPC applications for power systems [45,46] and on distributed computing for online real-time applications [47]. Computational paradigms changed exponentially, reducing those reviews to pieces of history. The latest relevant, comprehensive reviews on the topic were by R. C. Green et al. for general power system applications [48], and again focused on innovative smart grid applications [49]. These two handle a variety of topics in power systems. This work adds to existing reviews, providing a fresh overview of the state-of-the-art. It distinguishes itself by providing the full context and history of parallel computation and HPC in electrical system optimization and its development up to the latest work in the field. It also provides a thorough base and background for newcomers to the field of power system optimization in terms of both computational paradigms and applied algorithms. It highlights the importance of defaulting HPC utilization in a net-zero future grid. Finally, it brings to light the necessity of integrating HPC in future studies amidst the energy transition and suggests a framework that encourages future collaboration to accelerate HPC deployment. Table 1 shows a side-by-side comparison of related reviews.
Most of the existing work in the literature applies to classical problems that are still relevant to the future grid. For example, the electrical system stability problems only become bigger and more complex with the new grid paradigms, especially with the introduction of synthetic inertia coming from massive wind and solar plants at many data points. However, the amount of HPC and cloud computing consolidation to the future smart grid vision is small, so only one section of this work introduces frameworks of cloud-smart grid integration with different network architectures and topologies. Accompanying cloud usage in future studies is important, as it introduces new considerations and complications, such as optimal resource allocation and cloud security issues. This paper is organized as follows: Sections 2 and 3 identify the main HPC components and parallel computation paradigms. The next six sections review parallel algorithms under both Multiple Instruction Multiple Data (MIMD) and Single Instruction Multiple Data (SIMD) paradigms split into their early development and state-of-the-art for each study, starting with Section 4 PF. Section 5 Optimal Power Flow (OPF). Section 6 UC. Section 7 reviews power system stability studies. Section 8 System State Estimation (SSE). Section 9 reviews unique formulations and studies. Section 10 reviews gird and cloud computation applications of classical problems. Section 11 reviews smart grid cloud-based applications. In these studies, novel approaches and algorithms and modifications to existing complex models are made and parallelized to achieve tractability, or their processing time is reduced below that needed for real-time applications. Many studies showed promising outcomes and made a strong case for further opportunities in using complex system models on HPC and Cloud for operational applications. Section 12 highlights and discusses the present challenges in the studies, re-projects the future of HPC in power systems and energy markets, and recommends a framework for future studies. Section 13 concludes the review. This review does not include machine learning or meta-heuristic parallel applications such as particle swarm and genetic algorithms. Furthermore, while it does include some studies related to the co-optimization of Transmission and Distribution systems, the study excludes parallel analysis and simulations of distribution systems. It is also important to note that this review does not include Transmission and generation system planning problems or models related to grid transitioning because of the lack of parallel or HPC application in the literature on such models, which is addressed in the discussion section.

Parallel Hardware
Parallel computation involves several tasks being performed simultaneously by multiple workers on a parallel machine. Here, a worker is a loose term and could refer to different processing hardware (e.g., a core, Central Processing Unit (CPU), or a compute node). Predominantly, parallel machines can be placed under two categories based on The Von Neumann architecture [50], MIMD, and SIMD machines, as shown in Figure 1. SIMD architecture dominated supercomputing with vector processors, but that changed soon after general-purpose processors became scalable in the 1990s [51,52]. Followed by transputers and microprocessors designed specifically for aggregation and scalability [53]. The effect of this shift in architecture on the algorithm design can be observed in the coinciding studies mentioned in this paper. The following subsection details the function of the present default computer hardware.

CPUs
CPUs were initially optimized for scalar programming and executing complex logical tasks effectively until they hit a power wall, leading to multicore architecture [54]. Today, they function as miniature superscalar computers that enable pipelining, task-level parallelism, and multi-threading [55]. They employ a variety of self-optimizing techniques, such as "Speculation", "Hyperthreading or "Simultaneous Multi-threading", "Auto vectorization", and "task dependency detection" [55]. They contain an extra SIMD layer that supports data-level parallelism, vectorization, and fused multiply-add with high register capacity [56,57]. Furthermore, CPUs use a hierarchy of memory and caches, as shown in Figure 2, which allows complex operations without Random Access Memory (RAM) fetching, from high-speed low-capacity (L1) to lower-speed, higher-capacity caches (L2 then L3). They give the CPU a distinct functional advantage over GPUs. Figure 2. Architecture illustration resembles a generic CPU from Intel Xeon series [58] showing the interconnection of Quick Path Interconnect (QPI), Memory Control Unit (MCU), with caches and RAM.

CPU
A Workstation CPU can have up to 16 processing cores, and server-level CPUs can have up to 128 cores in certain products [59]. Multi-threading is carried out on Application Programming Interfaces (API) such as Cilk or OpenMP, allowing parallelism of functions and loops. Using several server-level CPUs in multi-processing to solve massive decomposed problems is facilitated by APIs such as MPI.

GPUs
GPUs function very similarly to Vector Processing Units or Array processors, which used to dominate supercomputer design. They are additional components to a "Host" machine that sends kernels, which is essentially the CPU. GPUs were originally designed to render 3D graphics. They are especially good at vector calculations. The representation of 3D graphics has a "grid" nature and requires the same process for a vast number of input data points. This execution has been extended to many applications in scientific computing and machine learning, solving massive symmetrical problems or performing symmetrical tasks. Unlike CPUs, achieving efficiency in GPUs parallelism is a more tedious task due to the fine-grained SIMD nature and rigid architecture. Figure 3 shows a simple breakdown of the GPU instruction cycle. The GPU (Device) interfaces with the CPU (Host) through PCI express bus from which it receives instruction "Kernels". In each cycle, a Kernel function is sent and processed by vast amounts of GPU threads with limited communication between them. Thus, the symmetry of the parallelized task is a requisite, and the number of parallel threads has to be of specific multiple factors to avoid the sequential execution of tasks. Specifically, they need to be executed in multiples of 32 threads (a warp) and multiples of two streaming processors per block for the highest efficiencies.
GPUs can be programmed in C or C++. However, many APIs exist to program GPUs, such as OpenCL, HIP, C++ AMP, DirectCompute, and OpenACC. These APIs provide high-level functions, instructions, and hardware abstractions, making GPU utilization more accessible. The most relevant interface is the CUDA by NVIDIA since it dominates the GPU market in desktop and HPC/Cloud [60]. CUDAs libraries make NVIDIAs GPU's power much more accessible to the scientific and engineering communities.
GPU's different architecture may cause discrepancy and lower accuracy in results, as floating points are often rounded in a different manner and precision than in CPUs [61]. Nevertheless, these challenges can be worked around with CUDA and sparse techniques that reduce the number of ALUs required to achieve a massive speedup. Finally, GPUs can offer a huge advantage over CPUs in terms of energy efficiency and cost if their resources are used effectively and appropriately.

Other Hardware
There are two more notable parallel devices to mention. One is the Field Programmable Gate Arrays (FPGA). This chip consists of configurable logic blocks, allowing the user to have complete flexibility in programming the internal hardware and architecture of the chip itself. They are attractive as they are parallel, and their logic can be optimized for desired parallel processes. However, they consume a considerable amount of power compared to other devices, such as the Advanced RISC Machine. Those are processors that consume very little energy due to their reduced instruction set, making them suitable for portable devices and applications [62].

Aggregation and Paradigms
In the late 1970s, project ARPANET took place [63] UNIX was developed [64], and advancement in networking and communication hardware was achieved. The first commercial LAN clustering system/adaptor, ARCNET, was released in 1977 [65], and hardware abstraction sprung in the form of virtual memory, such as OpenVM, which was adopted by operating systems and supercomputers [66]. Around that same time, the concept of computer clusters was forming. Many research facilities and customers of commercial supercomputers started developing their in-house clusters of more than one supercomputer. Today's HPC facilities are highly scalable and are comprised of specialized aggregate hardware, as displayed in Figure 4. The communication between processes through aggregate hardware is aided by high-level software such as MPI, which is available in various implementations and packages such as Mpi4py in python or Apache, Slurm, and mrjob, to aid in data management, job scheduling, and other routines. Specific clusters might be designed or equipped with components geared more toward specific computing needs or paradigms. HPC usually includes tasks with rigid time constraints (minutes to days or maybe weeks) that require a large amount of computation. The High-Throughput Computing (HTC) paradigm involves long-term tasks that require a large amount of computation (months to years) [67]. The Many Task Computing (MTC) paradigm involves computing various distinct HPC tasks and revolves around applications that require a high level of communication and data management [68]. Grid or Cloud facilities provide the flexibility to adopt all the mentioned paradigms.

Grid Computing
The information age spurring in the 1990s set off the trend of wide-area distributed computing and "Grid Computing", the predecessor of the Cloud. Ian Foster coined the term with Carl Kesselman and Steve Tuecke, who developed the Globus toolkit that provides grid computing solutions [69]. Many Grid organizations exist today, such as Organizations such as NASA 3-EGEE and Open Science Grid. Grid computing shaped the field of "Metacomputing", which revolves around the decentralized management and coordination of Grid resources, often carried out by virtual organizations with malleable boundaries and responsibilities. The infrastructure of grids tends to be very secure and reliable, with an exclusive network of users (usually scientists and experts), discouraging virtualization and interactive applications. Hardware is not available on demand; thus, it is only suitable for sensitive, close-ended, non-urgent applications. Grid computing features provenance performance monitoring and is mainly adopted by research organizations.

Cloud Computing
Cloud computing is essentially the commercialization and effective scaling of Grid Computing driven by demand, and it is all about the scalability of computational resources for the masses. It mainly started with Amazon's demand for computational resources for its e-commerce activities, which precipitated Amazon to start the first successful infrastructure as a service-providing platform with Elastic Compute Cloud [70] for other businesses that conduct similar activities.
The distinction between Cloud and Grid is an implication of their business model. Cloud computing is way more flexible and versatile than Grid when it comes to accommodating different customers and applications. It relies heavily on virtualization and resource sharing. This makes Cloud inherently less secure, less efficient in performance than Grid, and more challenging to manage, yet way more scalable, on-demand, and overall more resource efficient. It achieves a delicate balance between efficiency and the cost of computation.
Today, AWS, Microsoft Azure, Oracle Cloud, Google Cloud, and many other cloud commercial services provide massive computational resources for various companies such as Netflix, Airbnb, ESPN, HSBC, GE, Shell, and the NSA. It only makes sense that the electrical industry will adopt the Cloud.

Virtualization
The appearance of virtualization caused a considerable leap in massive parallel computing, especially after the software tool Parallel Virtual Machines (PVM) [71] was created in 1989. Since then, tens and hundreds of virtualization platforms have been developed, and are used today on the smallest devices with processing power [72]. Virtualization allows resources to be shared in a pool, where multiple instances of different types of hardware can be emulated on the same metal. This means less hardware can be allocated or invested in Cloud computing for a more extensive user base. Often, the percentage of hardware used is low compared to the requested hardware. Idle hardware is reallocated to other user processes that need it. The instances initiated by users float on the hardware, such as clouds shifting and moving or shrinking and expanding depending on the actual need of the process.

Containers
While virtualization makes hardware processes portable, containers make software portable. Developing applications, software, or programs in containers allows them to be used on any Operating System (OS) as long as it supports container engines. That means one can develop a Linux-based software (e.g., that works on Ubuntu 20.04) in a container and run that same application on a machine with Windows OS or iOS installed. This flexibility applies to service-based applications that utilize HPC facilities. An application can be developed on containers, and clients can use it on their cluster or a cloud service. Figure 5 compares virtual machines with container layers. While this shows that a host OS is required, they can also run on bare metal, removing latency and development complexity. Docker and Apptainer (formerly known as singularity) are commonly used containerization engines in Cloud and Grid, respectively [73,74].

Fog Computing
Cloudlets, edge nodes, and edge computing are all related to an emerging IoT trend, Fog Computing. Fogs are computed nodes associated with a cloud that are geographically closer to the end-user or control devices. Fogs mediate between extensive data or cloud computing centers and users. This topology aims to achieve data locality, offering several advantages, such as low latency, higher efficiency, and decentralized computation.

Volunteer Computing
Volunteer computing is an interesting distributed computing model that originated in 1996 via a Great Internet Mersenne Prime Search [75], allowing individuals connected to the internet to donate their personal computer's idle resources for a scientific research computing task. Volunteer computing remains active today, with many users and various middleware and projects, both scientific and non-scientific, primarily based on BOINC [76], and in commercial services such as macOS Server Resources [77].

Granularity
Fine-grained parallelism appears in algorithms that frequently repeat a simple homogeneous operation on a vast dataset. They are often associated with embarrassingly parallel problems. The problems can be divided into many highly, if not wholly symmetrical simple tasks, providing high throughput. Fine-grained algorithms are also often associated with multi-threading and shared memory resources. Coarse-grained algorithms imply moderate or low task parallelism that sometimes involves heterogeneous operations. Today, coarse-grained algorithms are almost synonymous with Multi-Processing, where the algorithms use distributed memory resources to divide tasks into different processors or logical CPU cores.

Centralized vs. Decentralized
Centralized algorithms refer to problems with a single task or objective function, solved by a single processor, with data stored at a single location. When a centralized problem is decomposed into N subproblems, sent to N number of processors to be solved, and retrieved by the central controller to update variables, re-iterate, and verify convergence, the algorithm becomes a "Distributed" algorithm. The terms distributed and decentralized are often used interchangeably and are often confused in the literature. There is an important distinction to make between them. A Decentralized Algorithm is one in which the decomposed subproblems do not have a central coordinator or a master problem. Instead, the processes responsible for the subproblems communicate with neighboring processes to reach a solution consensus (several local subproblems with coupling variables where subproblems communicate without a central coordinator). The value of each type is not only determined by computational performance but the decision-making policy.
In large-scale complex problems, distributed algorithms sometimes outperform centralized algorithms. The speedup keeps growing with the problem size if the problem has "strong scalability". Distributed algorithms' subproblems share many global variables. This means a higher communication frequency, as all the variables need to be communicated back and forth to the central coordinator. Moreover, in some real-life problems, central coordination of distributed computation might not be possible. Fully decentralized algorithms solve this problem as their processes communicate laterally, and only neighboring processes have shared variables. Figure 6 illustrates the three schemes.

Synchronous vs. Asynchronous
Synchronous algorithms are ones in which the algorithm does not move forward until the parallel tasks at a certain step or iteration are executed. Synchronous algorithms are more accurate and efficient for tasks with symmetrical data and complexity. However, that is usually not the case in power system optimization studies. The efficiency of these algorithms suffers, however, when the tasks are not symmetrical. Asynchronous algorithms allow idling workers to take on new tasks, even if not all the adjacent processes are complete. This is possible only at the cost of accuracy when there are dependencies between parallel tasks. To achieve better accuracy in asynchronous algorithms, "Formation" needs to be ensured, meaning that while subproblems may have a deviation in the direction of convergence, they should keep a global tendency toward the solution.

Problem Splitting and Task Scheduling
Large emphasis must be placed on task scheduling when designing parallel algorithms. In multi-threading, synchronization of tasks is required to avoid "Race Conditions" that cause numerical errors due to multiple threads accessing the same memory location simultaneously. Hence, synchronization does not necessarily imply that processes will execute every instruction simultaneously but rather in a coordinated manner. Coordination mechanisms involve pipe-lining or task overlapping, which can increase efficiency and reduce the latency of parallel performance. For example, sub-tasks that take the longest time in synchronous algorithms can utilize idle workers of completed sub-tasks if no dependencies prevent such allocation. Dependency analysis is occasionally carried out when splitting tasks. In an elaborate parallel framework, such as in multi-domain simulations or smart grid applications, task scheduling becomes its own complex optimization problem, which is often solved heuristically. However, there exist packages such as DASK [78], which can help with optimal task planning and parallel task scheduling, as shown in Figure 7. DASK is a python library for parallel computing which allows easy task planning and parallel task scheduling optimization. The boxes in Figure 7 represent data and circles represent operations.

Parallel Performance Metrics
Solution time and accuracy are the main measures of the success of the parallel algorithm. According to the Amhals law of strong scaling, there is an upper limit to the speedup achieved for a fixed-size problem. Dividing a specific fixed-size problem into more subproblems does not result in a linear speedup. However, if the parallel portion of the algorithm increases, then proportionally increasing the subproblems or the number of processors could continuously increase the speedup, according to Gustafson's Law of strong scaling. The good news is that Gustafson's law applies to large decomposed power system problems.
There are three types of metrics most frequently used in the literature, as shown in Equation (1) A Linear Speedup is considered optimal, while a sublinear speedup is a norm because there is always a serial portion in a parallel code. Efficiency and scalability are vaguely related. Efficiency is mostly used to compare a specific parallel setup to a sequential one, whereas scalability is used to see how the parallel algorithm scales with increased hardware. A scalable algorithm is not necessarily an efficient one. When creating a parallel algorithm, emphasis on the quality of the serial portion of the algorithm must be ensured. A parallel algorithm, after all, is contained and executed by a serial code. Both sequential and parallel programs are vulnerable to major random errors caused by the Cosmic Ray Effect [79], which has been known to cause terrestrial computational problems [80]. Soft errors might be of some concern regarding real-time power system applications. However, in parallel programming, especially in multi-processing, reordering floating-point computations is unavoidable; thus, a tiny deviation in accuracy from the sequential counterpart is expected and should be tolerated, given that the speedup achieved justifies it. All the computing paradigms mentioned above are points to tweak and consider when creating and applying any parallel algorithm.

Power Flow
Power Flow (PF) studies are central to all power system studies involving network constraints. The principal goal of PF is to solve the network's bus voltages and angles to calculate the flows of the network. For some applications, PF is solved using DC power flow equations, approximations based on realistic assumptions. Solving these equations is easy and relatively fast and results in an excellent approximation of the network PF [29]. On the other hand, non-linear full AC Power flow equations need to be solved to obtain an accurate solution, and these require numerical approximation methods. The most popular ones in power system analysis are the Newton Raphson (NR) method, and the Interior Point Method (IPM) [37]. However, these methods are computationally expensive and too slow for real-time applications, making them a target for parallel execution. Table 2 at the end of this section summarizes contributions of MIMD-based PF studies at the end of it. Parallelism in PF in some earlier studies was achieved by restructuring the hardware to suit the algorithm. In what might be the first parallel power flow implementation, Taoka H. et al. designed their multi-processor system around the Gauss Iterative method in 1981, such that each i8085 processor in the system would solve the power flow for each bus [81]. Instead of modifying the algorithm, the hardware was tailored to it, achieving a linear speedup compared to the sequential implementation in [82]. Similarly, one year later, S. Chan and J. Faizo [83] reprogrammed the CDC supercomputer to solve the Fast-Decoupled Power Flow (FDPF) algorithm in parallel. FPGAs were also used to parallelize LU decomposition for PF [84,85]. The hardware modification approach, while effective, is, for the most part, impractical, and it is common sense to modify algorithms to fit existing scalable hardware.

MIMD Based Studies
The other way to parallelize PF (or OPF) is through network partitioning. While network partitioning usually occurs at the problem level in OPF, in PF, the partitioning often happens at the solution/matrix level. Such partitioning methods for PF use sparsity techniques involving LU decomposition, forward/backward substitution, and diakoptics that trace back to the late 1960s, predominantly by H. H. Happ [86,87] for load flow on typical systems [88,89], and dense systems [41]. Parallel implementation of PF using this method started in the 1980s on array processors such as the VAX11 [90] and later in the 1990s on the iPSC hypercube [91]. Techniques such as FDPF were also parallelized on the iPSC using Successive Over-relaxation (SOR) on Gauss-Sidel (GS) [92], and on vector computers such as the Cray, X/MP using Newtons FDPF [93]. PF can also be treated as an unconstrained non-linear minimization problem, which is precisely what E. Housos and O. Wing [94] did to solve it using a parallelizable modified conjugate directions method.
When general processors started dominating parallel computers, their architecture was homogenized, and the enhancements achieved by parallel algorithms became comparable and easier to experiment with. This enabled a new target of optimizing the parallel techniques themselves. Chen and Chen used transputer-based clusters to test the best workload/node distribution on clusters, and [95] and a novel BBDF approach for power flow analysis [96]. The advent of Message Passing Interface (MPI) allowed the exploration of scalability with the Generalized Minimal Residual Method (GMRES) in [97] and the multi-port inversed matrix method [98] as opposed to the direct LU method. Beyond this point, parallel PF shifted heavily towards using SIMD hardware (GPUs particularly), except for a few studies involving elaborate schemes. Some examples include transmission/distribution, smart grid PF calculation [99], or Optimal network partitioning for fast PF calculation [100]. Table 2. Power flow MIMD-based state-of-the-art studies.

Paper
Contribution [81] Gauss Iterative method on a specialized machine [82] Tailored parallel hardware for FDPF [83] Reprogrammed CDC supercomputer to parallelize FDPF [84,85] Reprogrammed FPGA to parallelize LU method [90] LU method on VAX11 [91] LU method on iPSC hypercube [92] FDPF using SOR on Gauss-Siedel on IPSC hypercube [93] Newton FDPF on Cray X/MP [94] Parallel conjugate directions method [95] Novel BBDF on transputer-based cluster [97] MPI scheme parallelizing GMRES [98] MPI scheme parallelizing multi-port inverse matrix method 4.2. SIMD Based Studies 4.2.1. Development GPU dominates recent parallel power system studies. The first power flow study implementation might have been achieved by using a preconditioner Biconjugate Gradient Algorithm and sparsity techniques to implement the NR method on a NVIDIA Tesla C870 GPU [101]. Some elementary approaches parallelized the computation of connection matrices for networks where more than one generator could exist on a bus on a NVIDIA GPU [102]. CPUs were also used in SIMD-based power flow studies since modern CPUs exhibit multiple cores; hence, multi-threading with OpenMP can be used to vectorize NR with LU factorization [103]. Some resorted to GPUs to solve massive batches of PF for Probabilistic Power Flow (PPF) or contingency analysis, thread per scenario, such as in [104]. Others modified the power flow equations to improve the suitability and performance on GPU [105,106].
While many papers limit their applications to NVIDIA GPUs by using CUDA, OpenCL, a general parallel hardware API, has also been used occasionally [107]. Some experimented with and compared the performance of different CUDA routines on different NIVIDIA GPU models [108]. Similar experimentation on routines was conducted to solve ACOPF using FDPF [109]. In [110], NR, Gauss Sidel, and Decoupled PF were tested and compared against each other on GPU. Improvement on the Newtons Method and parallelizing different steps of it were performed previously [111]. Asynchronous PF algorithms were applied on GPU, which sounds difficult, as the efficiency of GPUs depends on synchronicity and hegemony [112]. Even with the existence of CUDA, many still venture into creating their routines with OpenCL [113,114] or direct C coding [115] of GPU hardware to fit their needs for PF. Very recently, a few authors made thorough overviews for parallel power flow algorithms on GPU covering general trends [116,117] and specifically AC power flow GPU algorithms [118]. In the State-of-the-Art subsection, the most impactful work is covered. Table 3 summarizes the latest SIMD-based PF studies contributions. A lot of the recent work in this area focuses on pre-conditioning and fixing ill-conditioning issues in iterative algorithms to solve the created Sparse Linear Systems (SLS). An ill-conditioned problem exhibits a massive change in output for a minimal change in input, making the solution to the problem hard to find iteratively. Most sequential algorithms are LU-based direct solvers, as they do not suffer from ill-conditioning. However, Iterative solvers such as the Conjugate Gradient method, which have been around since the 1990s [119], are regaining traction for their scalability and parallel computing advancement.

State-of-the-Art
The DC Power Flow (DCPF) problem was solved using the Chebyshev preconditioner and conjugate gradient iterative method in a GPU (448 cores Tesla M2070) implementation in [120,121]. The vector processes involved are easily parallelizable in the most efficient way with CUDA libraries such as CUBLAS and CUSPARSE, which are Basic Linear Algebra Subroutine and Sparse Matrix Operation Libraries. This work used comparisons of sparse matrix storage formats, such as the Compressed Sparse Row (CSR) and Block Compressed Sparse Row (BSR). On the largest system size, a speedup of 46× for the pre-conditioning step and 10.79× for the conjugate gradient step was achieved compared to a sequential Matlab implementation (8-core CPU).
Later, the same author went on to Parallelize the FDPF using the same hardware and pre-conditioning steps [122]. Two natural systems were used, the Polish system, which had groups of locally connected systems, and the Pan-European system, which consisted of several large coupled systems. This topology difference results in a difference in the sparsity patterns of the SLS matrix, which offers a unique perspective. Their proposed GPU-based FDPF was implemented with Matlab on top of MatPower v4.1. In their algorithm, the CPU regularly corresponds with the GPU, sending information back and forth over one iteration. Their tests showed that the FDPF performed better on the Pan-European system because its connections were more ordered than the Polish system. CPU-GPU communication frequently occurred in their algorithm steps, most likely bottlenecking the speedup of their algorithm (less than 3× achieved compared to CPU only).
Instead of adding pre-conditioning steps, M. Wang et al. [123] focus on improving the continuous Newtons method such that a stable solution is found even for an ill-conditioned power flow problem. For example, if any load or generator power exceeds 3.2 p.u. in the IEEE-118 test case, the NR method fails to converge; even if the value is realized in any iteration, their algorithm will still converge to the solution with their method. This was achieved using different-order numerical integration methods. The CPU loads data into GPU and extracts the results upon convergence only, making the algorithm very efficient. The approach substantially improved over the previous work by removing the pre-conditioning step and reducing CPU-GPU communication (speedup of 11× compared to CPU-only implementation).
Sometimes, dividing the bulk of computational load between the CPU and GPU (a hybrid approach) can be more effective depending on the distribution of processes. In one hybrid CPU-GPU approach, a heavy emphasis on the sparsity analysis of PF-generated matrices was made in [124]. When using a sparse technique, the matrices operated on are reduced to ignore the zero terms. For example, the matrix is turned into a vector of indices referring to the non-zero values to confine operations to these values. Seven parallelization schemes were compared, varying the techniques used (Dense vs. Sparse treatment), the majoring type (row vs. column), and the threading strategy. Row/columnmajor signifies whether the matrix's same row/column data are stored consecutively. The thread invocation strategies varied in splitting or combining the calculation of P and Q of the mismatch vectors. In this work, two sparsity techniques were experimented with, showing a reduction in operations down to 0.1% of the original number and two or even three orders of magnitude performance enhancement for power mismatch vector operations. In 100 trials, their best scheme converged within six iterations on a four-core host and a GeForce GTX 950M GPU, with a small deviation in solution time between trials. CPU-GPU communication took about 7.79-10.6% of the time, a fairly low frequency. However, the proposed approach did not consistently outperform a CPU-based solution with all of these reductions. The authors suggested that this was due to using higher-grade CPU hardware than the GPU.
Zhou et al. might have conducted the most extensive research in GPU-accelerated batch solvers in a series between 2014 and 2020. They fine-tune the process of solving PPF for GPU architecture in [125,126]. The strategies used include Jacobian matrix packaging, contiguous memory addresses, and thread block assignment to avoid divergence of the solution. Subsequently, they use the LU-factorization solver from previous work to finally create a batch-DPF algorithm [127]. They test their batch-DPF algorithm on three cases: 1354-bus, 3375-bus, and 9241-bus systems. For 10,000 scenarios, they solved the largest case within less than 13 s, showing the potential for online application.
Most of the previous studies solve the PF problem in a bare and limited setup when compared to the work by J. Kardos et al. [128] that involves similar techniques in a massive HPC framework. Namely, preventative Security Constrained Optimal Power Flow (SCOPF) is solved by building on an already existing suite called BELTISTOS [129]. BELTISTOS specifically includes SCOPF solvers and has an established Schurs Complement Algorithm that factorizes the Karush-Kuhn-Tucker (KKT) conditions, allowing for a great degree of parallelism in using IPM to solve general-purpose Non-Linear Programming (NLP) problems. Thus, the main contribution of this work is in removing some bottlenecks and ill-conditioning that exist in Schur's Complement steps introducing a modified framework (BELTISTOS-SC). The parallel Schur algorithm is bottlenecked by a dense matrix associated with the solution's global part. This matrix is solved in a single process. Since GPUs are meant to be used for dense systems, they factorize the system and apply forwardbackward substitution, solving it using cuSolve, a GPU-accelerated library to solve dense linear algebraic systems. Table 3. Power flow SIMD-based state-of-the-art studies.

Paper
Contribution [120,121] Chebyshev pre-conditioner and conjugate gradient iterative method [122] Parallel FDPF [123] Improving the continuous Newtons method [124] 7 parallelization strategies with sparsity analysis [125,126] Probabalistic Power Flow parallelization [127] Batch decoupled power flow [128] Security Constrained OPF (BELTISTOS software) They performed their experiments using a multicore Cray XC40 computer at the Swiss National Supercomputing Centre. They used 18 2.1 GHz cores, NVIDIA Tesla P100 with 16 GB memory, and many other BELTISTOS and hardware-associated libraries. They tested their modification on several system sizes from PEGASE1354 to PEGASE13659. Their approach sped up the solution of the Dense Schur Complement System by 30× for the largest system over CPU solution of that step, achieving notable speed up in all systems sizes tested. They later performed a large-scale performance study, where they increased the number of computing cores used from 16 to 1024 on the cluster. The BELTISTOS-SC augmented approach achieved up to 500× speedup for the PEGASE1354 system and 4200× for the PEGASE9241 when 1024 cores are used, demonstrating strong scalability up to 512 cores.

Optimal Power Flow
Like PF, OPF studies are the basis of many operational assessments such as System Stability Analysis (SSA), UC, ED, and other market decisions [29]. Variations of these assessments include Security Constrained Economic Dispatch (SCED) and SCOPF, both involving contingencies. OPF ensures the satisfaction of network constraints over cost or power-loss minimization objectives. The full ACOPF version has non-linear, non-convex constraints, making it computationally complex and making it difficult to reach a global optimum. DC Optimal Power Flow (DCOPF) and other methods, such as decoupled OPF, linearize and simplify the problem, and when solved, they produce a fast but sub-optimal solution. Because DCOPF makes voltage and reactive power assumptions, it becomes less reliable with increased RE penetration. RE deviates voltages and reactive powers of the network significantly. This is one of the main drivers behind speeding up ACOPF in realtime applications for all algorithms involving it. The first formulation of OPF was achieved by J. Carpenter in 1962 [130], followed by an enormous volume of OPF formulations and studies, as surveyed in [131].

Development
OPF and SCOPF decomposition approaches started appearing in the early 1980s using P-Q decomposition [132,133] and including corrective rescheduling [134]. The first introduction to parallel OPF algorithms might have been by Garng M Huang, and Shih-Chieh Hsieh in 1992 [135], who proposed a "textured" OPF algorithm that involved network partitioning. In a different work, they proved that their algorithm would converge to a stationary point and that with certain conditions, optimality is guaranteed. Later, they implemented the algorithm on the nCUBE2 machine [136], showing that both their sequential and parallel textured algorithm is superior to non-textured algorithms. It was atypical for studies at the time to highlight portability, which makes Huang's work in [92] special. It contributed another OPF algorithm using Successive Overrelaxation by making it "Adaptive", reducing the number of iterations. The code was applied on the nCUBE2 and ported to Intel iPSC/860 hypercube, demonstrating its portability.
In 1990, M.Teixeria et al. [137] demonstrated what might be the first parallel SCOPF on a 16-CPU system developed by the Brazilian Telecom R&D center. The implementation was somewhat "makeshift" and coarse to the level where each CPU was installed with a whole MS/DOS OS for the multi-area reliability simulation. Nevertheless, it outperformed a VAX 11/780 implementation and scaled perfectly, was still 2.5 times faster than running on, and exhibited strong scalability.
Distributed OPF algorithms started appearing in the late 1990s with a coarse-grained multi-region coordination algorithm using the Auxiliary Problem Principle (APP) [138,139]. This approach was broadened much later by [140] using Semi-Definite Programming and Alternating Direction Method of Multiplier (ADMM). Prior to that, ADMM was also compared against the method of partial duality in [141]. The convergence properties of the previously mentioned techniques and more were compared comprehensively in [142].
The asynchronous parallelization of OPF first appeared on preventative [143], and corrective SCOPF [144] targeting online applications [47] motivated by the heterogeneity of solution time of different scenarios. Both SIMD and MIMD machines were used, emphasizing portability as "Getsub and Fifo" routines were carried out. On the same token, MPI protocols were used to distribute and solve SCOPF, decomposing the problem with GRMES and solving it with the non-linear IPM method varying the number of processors [145]. Real-time application potential was later demonstrated by using Benders decomposition instead for distributed SCOPF [146]. Benders decomposition is one of the most commonly used techniques to create parallel structures in power system optimization problems, and it shows up in different variations in the present literature. As Figure 8 shows, Benders Decomposition is applied by fixing the complicating variables of the objective function to a different value in every iteration and constructing a profile with benders cuts to find the minimum objective function value with respect to the complicating variables. If the profile is non-convex, then optimality is not guaranteed since the value is changed in steps descending the slope of the cuts.  Table 4 summarizes the contributions of MIMD-based OPF studies of this section. As mentioned, involving AC equations in large-scale power system studies is crucial and might soon become the standard. The difficulty of achieving this feat varies depending on the application. For example, precise nodal price estimation is attained by solving ACOPF multiple times under different possible system states (Probabilistic ACOPF). This can be effortlessly scaled as each proposed scenario can be solved independently, but a large number of scenarios can exhaust available resources. This is when researchers resort to scenario reduction techniques such as Two-Point Estimation [147]. In this study, it was applied to 10k Monte Carlo (MC) scenarios, and the reduced set was used to solve a conically relaxed ACOPF model following the approach in [148]. Using 40 cores from the HPC cluster of KTH Royal Inst of Tech, the approach resulted in an almost linear speedup with high accuracy on all test cases.

State-of-the-Art
In contrast, parallelizing a monolithic ACOPF problem itself is much more complicated. However, the same authors did this readily since their model was already decomposable due to the conic relaxation [149]. Here, the choice of network partitions is treated as an optimization problem to realize the least number of lines between sub-networks. A graph partitioning algorithm and a modified benders decomposition approach were used, providing analytical and numerical proof that they converge to the same value as the original benders. This approach achieved a lower-upper bound gap of around 0-2%, demonstrating scalability. A maximum number of eight partitions (eight subproblems) were divided on a four-core 2.4 GHz workstation. Beyond four partitions, hyperthreading or sequential execution must have occurred. This is a shortcoming, as only four threads can genuinely run in parallel at each time. Hyper-threading only allows a core to alternate between two tasks. Their algorithm might have even more potential if distributed on an HPC platform.
The ACOPF formulation is further coupled and complicated when considering Optimal Transmission Switching (OTS). The addition of binary variables ensures the nonconvexity of the problem, turning it from an NLP to an Mixed Integer Non-Linear Programming (MINLP). In Lan et al. [150], this formulation is parallelized for battery storage embedded systems, where temporal decomposition was performed, recording the State of Charge (SoC) at the end of each 6 h (four subproblems). They employed a two-stage scheme with an NLP first stage to find the ACOPF of a 24 h time horizon and transmission switching in the second stage. The recorded SOCs of the first stage are added as constraints to the corresponding subproblems, which are entirely separable. They tested the algorithm IEEE-188 test case and solved it with Bonmin with GAMS on a four-core workstation. While the coupled ACOPF-OTS formulation achieved a 4.6% Optimality gap at a 16 h 41 min time limit, their scheme converged to a similar gap within 24 m. The result is impressive, considering the granularity of the decomposition. This is yet another example in which a better test platform could have shown more exciting results, as the authors were limited to parallelizing four subproblems. Algorithm-wise, an asynchronous approach or better partition strategy is needed, as one of the subproblems took double the time of all the others to solve.
The inclusion of voltage and reactive power predicate the benefits gained by ACOPF. However, it is their effect on the optimal solution that matters, and there are ways to preserve that while linearizing the ACOPF. The DCOPF model is turned into a Mixed Integer Linear Programming (MILP) in [151] by adding on/off series reactance controllers (RCs) to the model. The effect of the reactance is implied by approximating its value and adding it to the DC power flow term as a constant without actually modeling reactive power. The binary variables are relaxed using the Big M approach to linearize the problem, derive the first-order KKT conditions, and solve it using the decentralized iterative approach. Each node solved its subproblem, making this a fine-grained algorithm, and each subproblem had coupling variables with adjacent buses only. The approach promised scalability, and its convergence was proven in [152]. However, it was not implemented in parallel, and the simulation-based assumptions are debatable.
Decentralization, in that manner, reduces the number of coupling variables and communication overhead. However, this also depends on the topology of the network, as shown in [111]. In this work, a stochastic DCOPF formulation incorporating demand response was introduced. The model network was decomposed using ADMM and different partitioning strategies where limited information exchange occurs between adjacent subsystems. The strategies were implemented using MATLAB and CPLEX on a six-bus to verify solution accuracy and later on larger systems. The ADMMBased Fully Decentralized DCOPF and Accelerated ADMM for Distributed DCOPF were compared. Recent surveys on Distributed OPF algorithms showed that in OPF decomposition and parallelization, ADMM and APP are preferred in most of the studies as a decomposition technique [149,153]. The distributed version converged faster, while the decentralized version exhibited better communication efficiency. More importantly, a separate test showed that decentralized algorithms work better on subsystems that exhibit less coupling (are less interconnected) and vice versa. This breeds the idea that decentralized algorithms are better suited for ring or radial network topologies while distributed algorithms are better for meshed networks [153,154]. Table 4. Optimal Power Flow MIMD based state-of-the-art studies.

Paper
Contribution [147] Two-Point estimation for scenario reduction (Monte Carlo Solves) [148] Conically Relaxed ACOPF (Monte Carlo Solves) [149] Monolithic ACOPF model parallelization using conic relaxation [150] Battery storage and Transmission Switching with Temporal Decomposition [111] Comparing decentralized and distributed algorithms on different network topologies [155] Transmission-Distribution network co-optimization, [99,156] Two-stage stochastic algorithm accounting for DER Distribution networks tend to be radial. A ring topology is rare, except in microgrids. Aside from their topology, they have many differences compared to transmission networks causing the division of their studies and OPF formulations. OPF for Transmission-Distribution co-optimization makes a great case for HPC use in power system studies, as co-optimizing the two together is considered peak complexity. S. Tu et al. [155] decomposed a very large-scale ACOPF problem in the Transmission-Distribution network co-optimization attempt. They devised a previously used approach where the whole net-work was divided by its feeders, where each distribution network had a subproblem. The novelty in their approach lies in a smoothing technique that allows gradient-based nonlinear solvers to be used, particularly the Primal-Dual-Interior Point Method, which is the most commonly used method for solving ACOPF. Similar two-stage stochastic algorithms were implemented to account for the uncertainties in Distributed Energy Resources (DER) at a simpler level [99,156].
S. Tu et al. used an augmented IEEE-118 network, adding distribution systems to all busses, resulting in 9206 buses. Their most extreme test produced 11,632,758 bus solutions (1280 scenarios). Compared to a generic sequential Primal Dual Interior Primal Method (PDIPM), the speedup of their parallelized approach increased linearly with the number of scenarios and scaled strongly by increasing the number of cores used in their cluster. In contrast, the serial solution time increased superlinearly and failed to converge within a reasonable time in a relatively trivial number of scenarios. While their approach proved to solve large-scale ACOPF much faster than a serial approach, it falls short in addressing Transmission-Distribution co-optimization because it merely considered the distribution network as sub-networks with the same objective as the transmission, which is unrealistic.

Unit Commitment
The UC problem goes back to the 1960s [157]. In restructured electricity markets, Security Constrained Unit Commitment (SCUC) is used to determine the generation schedule at each time point at the lowest cost possible while maintaining system security. A typical formulation used in today's industry can be found in [158]. Often, implementations use immensely detailed stochastic models involving N-1 or N-2 contingencies [159], ACOPF constraints, and incorporating RE resources DER [160] and distribution networks [161]. This leads to a large number of scenarios and a tremendously complex problem. Decomposition of the problem using Lagrangian Relaxation (LR) methods is very common [36,162] and many formulations are ready to segue for HPC parallel implementation. This includes global optimal solution methods for AC-SCUC, as in [163]. Recent literature on parallel UC is abundant, making this section the largest in this review. Simulations of parallel environments to implement parallel UC algorithms started appearing around 1994, modeling hydrothermal systems [164] and stochasticity [165] on supercomputers [166] workstation networks [167]. The earliest UC implementations on parallel hardware used embarrassingly parallel metaheuristics, such as simulated annealing [165] and other genetic algorithms [168,169]. However, the first mathematical programming approach might have been conducted by Misra in 1994 [166] using dynamic programming and vector processors. Three years later, K.K. Lau and M.J. Kumar also used dynamic programming to create a decomposable master-slave structure of the problem. It was then distributed over a network of workstations using PVM libraries, with each subproblem solved asynchronously [167]. These, however, were not networkconstrained problems.
In 2000, Murillo-Sanchez and Robert J. Thomas [170] attempted full non-linear AC-SCUC in parallel by decomposing the problem using APP but failed to produce any results, upholding the problem's complexity. In quite an interesting case, volunteer computing with the BOINC system was used to parallelize the MC simulations of stochastic load demand in UC problems with hydrothermal operation [171]. However, that did not include network constraints either. There has been some work conducted on decomposition algorithms for network-constrained UC, but this has rarely been applied in a practical parallel setup. Most parallel implementations happened in the last decade. Table 5 summarizes the latest contributions of MIMD-based UC studies of this section. Papavasillio et al. [172], set up the framework of a scenario-based two-stage stochastic framework of UC with reserve requirement for wind power integration, emphasizing wind forecasting and scenario selection methodology. In later work, Papavasiliou compared a benders decomposition approach that removes the second-stage bottlenecking and a Lagrangian Relaxation (LR) algorithm based on [162], where the impact of contingencies on decisions was implied in the constraints [160]. The LR approach proved to be more scalable for that formulation. As a result, Papavasillo chose the LR approach to solving the same formulation by adding DC network constraints [173]. Even though the wind scenarios were carefully selected, there existed instances in which specific subproblems took about double the time of the following most complex subproblem. In a follow-up work, Aravena and Papavasillio resorted to an asynchronous approach in [174] that allows time sharing. This work showed that the synchronous approach to solving subproblems could be highly inefficient, as the idle time of computational resources can reach up to 84% compared to the asynchronous algorithm.

State-of-the-Art
While many SCUC parallel formulations use DC networks, the real challenge is using ACOPF, as exhibited in earlier failed attempts [170]. In [175], the conic relaxation approach mentioned earlier [148] was used to turn the AC-SCUC into a Mixed Integer Second-Order Conic (MISOC) program. It allowed the decomposition of the problem to a master problem, where UC is determined, and a subproblem in which the ACOPF is solved iteratively. In their approach, the active power is variable if the master problem is fixed, and only the reactive power is solved to check if the commitment is feasible. Fixing the active power allows for time decomposition since ramping constraints no longer apply in the subproblems. They compared the computational efficiency of their approach against a coupled DC-SCUC and AC-SCUC, solved using commercial solvers such as GAMS, DICOPT, and SBB. Their approach took only 3.3% of the time taken by previous similar work [176] to find a solution at a 0.56% gap. However, their approach faced accuracy and feasibility issues, and the parallelism strategy was unclear since they created eight subproblems while using three threads.
Temporal decomposition was also used on a unique formulation, Voltage Stability Constrained SCUC (VSC-SCUC) [177]. The problem is an MINLP with AC power flow constraints and added voltage stability constraints that use an index borrowed from [178]. APP decomposition was used to decompose the model into 24 subproblems, and it was compared against conventional AC-SCUC on several test cases. It converged after 55 iterations compared to 44 by the AC-SCUC solution on the IEEE-118 case. The structure and goals achieved by VSC-SCUC make tractability challenging, deeming the approach itself promising. However, the study fell short of mentioning any details about the claimed parallel routine used.
Nevertheless, SCUC decentralization is valued for more than performance enhancement. It can help achieve privacy and security, and it fits the general future of the smart grid, IoT, and market decentralization. In a market in which the ISO sets prices of energy and generators are merely price takers, a decentralization framework called "Self-Commitment" can be created from the UC formulation [179]. Inspired by self-commitment, Feizollahi et al. decentralize the SCUC problem relevant to bidding markets, including temporal constraints [180]. They implement a "Release-and-Fix" process which consists of three ADMM stages of decomposing the network. The first stage finds a good starting point by solving a relaxed model. The second and third stages are iterative; a feasible binary solution is found, followed by a refinement of the continuous variables. They used two test cases (3012 and 3375-bus) and applied different partitions from 20 to 200 sub-networks with different root node (coordinator node) combinations. They also varied the level of constraints in three cases, from no network up to AC network and temporal constraints. A sub 1% gap was achieved in all cases, outperforming the centralized solution in the most complicated cases and showing scalability where the centralized solution was intractable. The scalability saturated, however, at 100 partitions, and one of the key conclusions was that the choice of the root node and partition topologies are crucial to achieving gains.
Multi-Area formulations often involve ED, but rarely UC, as seen in [181]. The UC formulation in this study includes wind generation. The wind uncertainty is incorporated using Adjustable Interval Robust Scheduling of wind power, a novel extension of Interval Optimization Algorithms, a chance-constrained algorithm similar to Robust Optimization. The resulting Mixed-Integer Quadratic Programming (MIQP) model is decentralized using an asynchronous consensus ADMM approach. They verified the solution quality on a threearea six-bus system (achieving a 0.06% gap) and then compared their model against Robust Optimization and Stochastic Optimization models on a three-area system composed of one IEEE-118 bus system each. For a lower CPU time, their model achieved a much higher level of security than the other mentioned models. The study mentions that the parallel procedure took half the time the sequential implementation did, promising scalability. However, no details were given regarding the parallel scheme used and implementation.
Similarly, Ramanan et al. employed an asynchronous consensus ADMM approach to achieve multi-area decentralization slightly differently, as their formulation is not a consensus problem [182]. Here, the algorithm is truly decentralized, as the balance in coupling variables needs only to be achieved between a region and its neighboring one. The solution approach is similar to that of the one from [172], where UC and ED are solved iteratively. They divided the IEEE 118 bus into ten regions; each region (subproblem) was solved by one intel Xeon 2.8 GHz processor. They ensure asynchronous operation by adding 0.2 s of delay for some subproblems. The mean results of the 50 runs demonstrated the time-saving and scaling potential of the asynchronous approach that was not evident in other similar studies [183]. However, the solution quality significantly varied and deteriorated, with an optimality gap reaching 10% for some runs, and no comparison was drawn against a centralized algorithm.
In later work, the authors improved the asynchronous approach by adding some mechanisms, such as solving local convex relaxations of subproblems while consensus is being established. This allowed the subproblem to move to the next solution phase if the binary solution was found to be consistent over several iterations [184]. In addition, they introduced a globally redundant constraint based on production and demand to improve privacy further. Moreover, they used point-to-point communication without compromising the decentralized structure. They implemented their improved approach on an IEEE-3012 bus divided into 75, 100, and 120 regions. A 2.8 GHz core was assigned to solve each region and controller subproblem. They compared their approach, this time against Feizollahis implementation from 2015 [180] and a centralized approach. The idle time of the synchronous approach was higher than the computation time of the Asynchronous approach, doubling the scalability for higher region subdivisions. The gap achieved in all cases was larger than that of the Centralized solution by around 1.5%, which is a huge improvement, considering the previous work and the 18x speedup achieved.
Consensus ADMM methods typically do not converge for MILP problems such as UC without a step size diminishing property [185]. Lagrangian methods, in general, are known to suffer from a zigzagging problem. To overcome the issue, the Surrogate Lagrangian Relaxation (SLR) algorithm was used in [186] to create a distributed asynchronous UC. In later work, their approach was compared against a Distributed Synchronous SLR and a sequential SLR [187] using four threads to parallelize the subproblems. With that, better scalability against the synchronous approach was demonstrated, and a significant speedup was achieved (12× speedup to achieve a 0.03% duality gap in one instance).
To avoid the same zigzagging issue, but for Multi-Area SCUC, Kargarian et al. opted for Analytical Target Cascading (ATC) since multiple options exist for choosing the penalty function in ATC [188]. They take the model from [189] and apply ATC from [190] to decompose the problem into a distributed bi-level formulation, with a central coordinator being the leader and subproblems followers. In this work, they switched the hierarchy by putting the coordinator first, making it the follower instead of the leader. This convexified the followers' problem, allowing the use of KKT conditions, turning it into a Mixed Complementarity Problem (MCP). Those steps turned the formulation into a decentralized one, as only neighboring subproblems became coupled. They numerically demonstrated that with their reformulation, the convergence properties of ATC were still upheld virtually and that the convex quadratic penalty functions act as local convexifiers of the subproblems. Moreover, they demonstrated numerically how the decentralized algorithm is less vulnerable to cyber attacks. Unfortunately, the approach was not implemented practically in parallel; rather, the parallel solution time was estimated based on the sequential execution of the longest subproblem.
In a similar work tackling Multi-Area SCUC, a variation of ACT is used [191] in which the master problem determines the daily transmission plan, and each area becomes an isolated SCUC subproblem. This problem is much more complicated, as it involves AC power flow equations, HVDC tie-lines, and wind generation. The power injection of the tie lines is treated as a pseudo generator with generation constraints that encapsulate the line flow constraints. This approach removes the need for consistency constraints used in traditional ATC-based distributed SCUC, such as the ones used in [190]. In the case study, they subdivide several systems into three-area networks and split their work into three threads. Comparisons were drawn against a centralized implementation, the traditional distributed form, and four different tie-line modes of operation with varying load and wind generation. Their approach consistently converged at lower times than the traditional ATC algorithm. It slightly surpassed the centralized formulation on the most extensive network of 354-bus, which means the approach has the potential for scalability.
Finding the solution of a UC formulation that involves the transmission network, active distribution network, microgrids, and DER is quite a leap in total network coordination. This challenge was assumed by [161] in a multi-level interactive UC (MLI-UC). The objective function of this problem contained three parts: the cost of UC at the transmission level, the cost of dispatch at the distribution level, and the microgrid level. The three levels' network constraints were decomposed using the ATC algorithm, turning it into a multilevel problem. A few reasonable assumptions were made to aid in the tractability of the problem. The scheme creates a fine-grained structure at the microgrid level and a coarser structure at the distribution level, both of which were parallelized. The distribution of calculation and information exchange between the three levels provides more information regarding costs at each level and the Distribution of Locational Marginal Pricing (DLMP).
Most of the previous work in power system problems-apart from UC-focuses on the solution process rather than the database operations involved. In [192], a parallel SCUC formulation for hydrothermal power systems is proposed, incorporating pumped hydro. This paper uses graph computing technologies and graph databases (NoSQL) rather than relational databases to parallelize the formulation of their MIP framework. Their framework involves Convex Hull Reformulation, a Special Ordered Set method to reduce the number of variables of the model, constrained relaxation techniques [193], and LU decomposition. The graph-based approach showed significant enhancement in speedup over a conventional MIP solution method on a Tigergraph v2.51 database. Similar applications of NoSQL were explored in other power system studies [194][195][196].
In real industrial applications, there is a lower emphasis on the accuracy of the solution, and a high-speed "good enough" policy is adopted, often using heuristics extensively. Midwest ISO (MISO) published a few papers showing the development of their Day-Ahead DC UC decision-making in 2016 [197], 2020 [35], and 2021 [198]. Their HPC parallel approaches introduce novel strategies as part of the HIPPO project [199], focusing on finding smart heuristics to speed up SCUC decomposition and distributed methods. MISO has been using CPLEX to solve day Ahead SCUC and SCED for 50,000 binary variables and 15,000 transmission constraints over 36 hourly intervals, and they limit the day ahead of SCUC to 1200 s. Figure 9 illustrates the processes run by the HIPPO system in parallel. They run several algorithms in parallel to ensure continuity. If the most accurate fails to converge within its time limit, the solution of the next most accurate algorithm is taken instead, such as SCED. The convergence criterion of the optimality gap is 0.1% which amounts to about $24,000. They use surrogate ADMM, Lazy transmission constraints from experience, and a Polishing-E method, which reduces the set of possible generators and an Uplift function to choose a good set of generators.
In their latest work [198], they introduce a neighboring search algorithm that improves their E-polishing algorithm and the selection of a set of lazy constraints. This HIPPO system uses 18 nodes and 24 2.3 GHz 64 GB RAM on the Pacific Northwest National Laboratory (PNNL) HPC cluster and uses Gurboi to solve their problems. In their study, they compared the time it takes to sequentially solve 110 different SCUC problems of different complexities against using HIPPO. They showed that problems that take a long time sequentially experience a more drastic speedup with HIPPO, meaning their system scales efficiently. One problem that took 2000 s in full sequential MIP was solved in 200 s with HIPPO. For all the tested cases, the highest solution time on the MISO network using HIPPO was 633 s, under the required standard time limit for finding solutions. In addition, they explore the possibility of solving at 15 min rather than hourly intervals, as this is more appropriate for RE generation. They solve the MIP for one hour and then feed that solution to the 15-min interval problem as an initial point. Compared to using root relaxation to solve 15-min intervals, a huge jump in speed is achieved once again for the more complex problems. Figure 9. HIPPO Concurrent Optimizer (vertically aligned processes run in parallel). Table 5. Unit Commitment MIMD based state-of-the-art studies.

Paper
Contribution [174] 84% idle time reduction with an Asynchronous approach and Time-Sharing [175] AC-SCUC converted to Mixed Integer Second Order Conic program [177] Voltage Stability Constrained SCUC with temporal decomposition using APP [180] Decentralized Market SCUC using ADMM for temporal decomposition [181] Multi-Area UC, a novel extension of Interval Optimization & Asynchronous ADMM [182] Non-consensus multi-area decentralization using iterative ED and UC [184] Asynchronus decentralized UC by convex relaxations and other techniques [186] Achieving faster convergence in distributed asynchronous UC by SLR [188] Multi-Area SCUC avoiding the zigzagging issue at convergence by using ATC [191] Multi-Area AC-SCUC distribution using a variation of ATC. [161] Multi-level interactive UC Involving ADN, micro-grids, and DER [192] Data parallelism focusing on graph databases (NoSQL) [35,197] Commercial HPC UC HIPPO project novel strategies [198] Neighboring search algorithm improving HIPPO from [35]

Power System Stability
Power system stability studies in this section include Static, Transient, and Dynamic Stability. A power system is considered stable if it can regain operating equilibrium with the entire system intact after being subjected to a disturbance. Depending on the type of study, the stability metric could be the line flows, the generator rotor angle, or bus voltage and frequency [200,201].
Static stability analysis involves solving the power flow or optimal power flow under contingency conditions. This type of study could involve either contingency screening (solving various PF problems in parallel) or solving the OPF or other power system problems monolithically under various contingencies when it comes to transient and dynamic simulations. These studies solve a set of differential and algebraic equations (DAE), which can be represented as: where x is the state vector of differential-algebraic variables besides voltages, V is the network's voltages vector, and Γ is the diagonal matrix where its elements are zero, where the equation is algebraic, and one where the equation is differential. Parallelism of those types of power system studies has been one of the most abundant and earliest to investigate, especially in transient stability. Most of these studies are performed offline, and the goal is to speed them up to solve them in real time.
7.1. MIMD Based Studies 7.1.1. Development Power system operators need to detect system states or schedules that carry the risk of steady-state emergency if single or multiple equipment failures occur. This is to alleviate that risk, either by changing the system state to avoid such an emergency (preventative measure) or by employing a form of control strategy that would mitigate that emergency if it occurs (corrective measure) [200]. Contingency screening, in particular, is an embarrassingly parallel task and one of the easiest to parallelize, as it involves splitting several conventional power flow problems equivalent to the number of possible contingencies over multiple job arrays.
In earlier studies, the parallel steady-state analysis involved parallelizing the different contingency cases and not the power flow algorithm. The first successful attempt at distributed contingency screening might have been in [202], where the process was distributed over four DN425 processors. By adding pre-filtering schemes and strategies to reduce the computational burden, real-time static security assessment was already achieved in the early 2000s using multi-processing [203,204]. Further studies in enhancing the allocation of resources and dynamic load balancing in order to reduce the idling time of processors were performed by PNNL on their local cluster with the aid of MPI [205,206]. The same research team followed up the work by applying parallel betweenness centrality to identify the high-impact transmission lines in the screening process.
Some work in the area emphasized processor load balancing to task scheduling for contingency screening, from master-slave scheduling to proactive task scheduling. Incorporating both multi-processing and multi-threading on various systems and using various concurrent programming languages such as D [207] and X10 [208]. Most of the work today for static security involves improvement in whole EMS systems and software, and most of the modern work involves the efficient allocation of cloud-based resources. However, more elaborate schemes are appearing that involve parallelizing the OPF within the contingency analysis, which offers room for improvement, especially when considering more complicated OPF and post-contingency network models.
Dynamic stability is another form of steady-state analysis that evaluates the system condition and oscillatory behavior after very small signals and disturbances that last for up to 30 s due to fluctuations in generation and load levels or controllers. It is an obvious parallelism candidate, since it is the most intensive computational task in power system studies, as it models the electromechanical interaction between system compo-nents and their controllers. It is also one of the most important and is directly related to secure operations. Figure 10 illustrates the two goal-posts of dynamic simulation during its development. Real-time simulation means the computational time matches the duration of the simulated interaction. Faster than real-time simulation means the computational time is lower than the duration of the simulated interaction. The goal for achieving practical real-time parallel dynamic simulations was already being set in the 1990s [209]. Pioneering applications used the Conjugate Gradient method on the Cray Y-MP, iPSC/860, and IBM 3090 mainframe [210,211]. Parallel dynamic simulation studies quickly moved to large system implementations emphasizing balanced network partitioning and computational load and creating parallel software tools in the 2000s [212]. Faster-than-real-time simulations became the default, and the first faster-than-real-time parallel dynamic simulation was achieved on the WECC system for the first time in 2013 [213]. Transient stability studies are made to ensure that after large disturbances such as circuit breaker trips or load loss, the system remains synchronized and can return to normal conditions. Parallel transient stability was explored in abundance, as they are critical, and the trapezoidal integration method makes them disposed to parallelism. Fernando L. Alvarado, in an impactful paper [214], demonstrated analytically that for time T and with some T 2 processors, transient stability differential equations could be solved in time order of log 2 T using the trapezoidal method with potentially better convergence properties than serial implementation. The Electric Power Research Institute made a report in 1977 exhibiting various works exploring potential parallel applications for power system analysis [42].
Most of the work until that point discussed potential parallel methods to apply on parallel machines such as Cray-1, CDC STAR-100, IBM 370-195, and ILLIAC. Moreover, much like power flow development, many proposed parallel architectures that would exploit the parallelism in using methods such as Chaotic Relaxation, BBDF Gauss-Seidel, and Newton's method [215,216]. Some simulated the performance of new microcomputer and processor architectures using networks of computers or existing supercomputers, such as the CDC 6600 [217,218].
The first parallel implementation of transient stability might have been the first parallel power system study implementation by F. Orem and W. It was solved using a CDC 6500 equipped with an AP-120B array processor hosted on a computer. Comparisons on different hardware, such as vector processors vs. array processors or Cray-1 vs. IBM-3081D, were made using the trapezoidal integration method with linear and non-linear loads [219,220]. Different decomposition methods started to be introduced in the literature, including parallel-in-space and parallel-in-time approaches. The combination of the two was achieved while using "Nested Iteration" and time-windowing to enhance convergence in [221][222][223].
The variety of algorithms used and the discovery of different parallelism paradigms in transient stability simulation were promising for parallelism in power system studies in general. SOR and The Maclaurin-Newton Method (MNM) were used on the Alient and IPSC machines in [224,225]. The waveform relaxation method (WRM) was proposed by [226] to decompose the non-linear system into several dynamical subsystems to be solved in parallel, followed by [221,227]. A Parallel-in-Frequency paradigm was introduced with a demonstration of the possibility of vector processing coarse-grained algorithm [228]. More complicated fine/coarse-grained frameworks combining general processing (CRAY Y-MP9/464) and vector processing [229].
The possibility for real-time simulation was first demonstrated on the NCube2 Hy-peCube [230]. HyperCube machines were popular in the early 1990s for transient studies exhibiting various techniques such as SOR [231], and LU factorization path trees with different communication scheduling techniques [91,232].
Message passing in heterogeneous clusters started appearing in the late 1990s. This included Real-time Contingency and Transient stability with PMV [233], and other formulations with MPI [234,235]. Faster than real-time transient stability simulation was achieved by combining MPI and multi-threading techniques with network and time-domain decomposition in [212,236,237]. Later algorithms were used [238]. A full rundown of most of the techniques used until that point for large-scale transient stability studies can be found in [239].
Electromagnetic Transient studies (EMT) are transient studies that assess systems' overcurrents and overvoltages due to fault conditions or large disturbances. EMT simulations are the most accurate tools used to describe fast dynamics of power systems, and hence are the most computationally intensive. Software such as PSCAD uses an EMTPtype program, or EMtp, which is the most widely used algorithm for this type of study. Parallel EMT studies were first proposed in the early 1990s on MIMD hardware [240] and implemented using network partitioning techniques on a hypercube machine with care to load balancing [241]. Since then, various parallel approaches have been used, parallelizing the problem in space and time. The Very Dishonest Newton method, as implemented on multi-computer setup [242], on FPGAs real-time simulation of EMT was aimed for [243]. Techniques such as system partitioning and solving short time-steps for more dynamic parts and long time-steps for others in combination with multi-processing have been proposed in [244], but with no actual implementation.
Most of the recent work today in power system stability studies, particularly transient stability, takes place on SIMD architecture (GPUs specifically). The traditional way to perform the study is similar to that of regular transient studies. Thus, in-space-in-time decomposition can be achieved with similar techniques such as LU factorization [235], forward and backward substitutions [242], and graph theory [245] to be solved in parallel. Table 6 summarizes the critical work of this section. When it comes to probabilistic studies performed in contingency analysis, not many studies apply the N-2 criterion. Duan et al. performed N-2 contingency analysis generating with PPF for each contingency case [246]. With an IEEE-300 test case, using AC power flow equations and the NR method, and 1000 MC system state scenarios, the solution was distributed on the Danzek HPC cluster at Manchester. The solution time was 168 h. The only thing that this work offers is a testimony to the complexity of full AC Power flow equations. N-2 criterion merely increases the number of embarrassingly parallel tasks. Any real improvement needs to be made either with a finer decomposition of the embedded power flow problems, or by incorporating complexity formulations used.

State-of-the-Art
In another N-2 reliability contingency analysis, transmission switching was incorporated for corrective action [247]. This work performed a dynamic stability analysis of the corrective transmission switching action to ensure its viability. AC power flow equations were used, and the analysis was conducted on the PJM interconnections (15.5 k bus system).
Both transmission and generation contingencies were considered, creating around 1.4 m contingencies. Heuristics were used to reduce the additional computation. The approach was very effective, as the solution time was reduced to 96 s using 128 threads (compared to 999 s with eight threads). This contrast between this result and the previous study demonstrates the importance of having a good parallel scheme.
One of the major challenges that face current implementations is the lack of standardization, as reflected in studies such as S. Jin et al. [248], which compared four different parallel implementations of the dynamic model. Some implementations were run on the IEEE-288 bus system and others on the WECC system, using different supercomputers with different hardware allocations for each implementation. This variety resulted in valuable but difficult-to-compare observations due to the unequal testing grounds. The work concludes that the direct integration method with a fast direct LU solver is the recommended approach for HPC dynamic simulation, as it enables faster than a real-time solver. Nevertheless, the recommendation is based on trials that are hard to compare fairly.
In an impactful series of works, P. Aristidou et al. implemented a parallel dynamic simulation on a transmission-distribution network using the Very Dishonest Newtons Method (VDHN) and Schur-Complement-based decomposition in 2014 [249] 2015 [250] and 2016 [251]. Their most significant test case included 15,226 buses, 21,765 branches, and detailed 3483 generator models, including their excitation systems and governors, voltage regulators, and power-system stabilizers. The software was written using standard Fortran language with OpenMP directives. The authors were very thorough in analyzing the performance, as their approach was compared against several, including fast and widely implemented sequential algorithms in terms of speedup and scalability. They tested their algorithm on various platforms, with the highest speedup achieved being ×4.7 on 24-core AMD processor-based systems. Faster than real-time dynamic simulation of the entire WECC system, interconnection (17,000 buses) was achieved in [109]. The simulation included machine models, exciter, governor, relay, and network models. With 16 core cores, they simulated a 0.05 s three-phase fault lasting for 20 s, with 0.005 s time steps, within 15.47 s. They achieved this feat using an open source HPC framework called GridPack, which they developed. GridPack is further explored in the discussion section of this review. Table 6. Power System Stability MIMD-based state-of-the-art studies. [246] N-2 contingency analysis with Probabilistic Power Flow [247] N-2 contingency analysis with corrective transmission switching [248] Comparison of four parallel dynamic simulation models [251] Dynamic simulation transmission-distribution network using VDHN and Schur-Complement. [109] Faster than the real-time dynamic simulation of the entire WECC system 7.2. SIMD-Based Studies 7.2.1. Development SIMD architecture has long been used for transient and dynamic simulations to solve the non-linear differential-algebraic equations (DAE) and parallelize the trapezoidal rule, and GPUs have been used in power system studies as early as 2007 [252].

Paper Contribution
The first paper using GPU for static stability was published the same year CUDA was released, where DC Power flow contingency analysis was performed to solve the Gauss-Jacobi algorithm [252]. The paper used a NIVIDIA 7800 GTX card and direct instructions as opposed to a CUDA interface. GPU-based contingency analysis studies have recently been enhanced with pre-conditioning methods such as the Krylov theory [253]. Pre-conditioned Conjugate Gradient method [254] and compensation method and FDPF to parallelize AC Power Flow (ACPF) within contingency analysis [255].
For transient stability, A parallel program was developed and used by engineers in Hydro-Quebec in 1995, which simulates transient stability by parallelizing the Very Dishonest Newtons Method (VDHN) with LU Decomposition on the shared memory machine Sequent Symmetry S81 [256]. Waveform-Relaxation method was used later on the same machine [257]. The first re-purposing of non-GPU image processing hardware for power system studies might have been in 2003 in [245], where PULSE IV image processor was used to achieve real-time transient stability simulation on WSCC 9 bus test case.
The first huge leap in performance was by Jalili-Marandi et al., in a hybrid approach achieving a speedup of up to ×344 for a 1248-bus system compared to CPU only approach [258]. Later Jalili-Marandi et al. refined the algorithm to perform all the calculations on GPU on an almost purely SIMD-based approach. Which proved to be more effective beyond 500-bus size systems [259]. In their last work, they showed the potential of GPU augmentation to enhance the inner solution performance. An instantaneous relaxation technique involving full newton iteration, implicit integration, and a sparse LU linear solver was tailored to run simultaneously on four T10 GPUs. A 9985 bus system generates a 22,272 × 22,272 matrix and 99.69% sparsity, which was solved within 5 min [260].
Yu et al. [261] performed another hybrid-based transient stability study by constructing a Jacobian Free Newton Generalized Minimal Residual method, which approximates the Jacobian vector products using the finite difference technique. While it eliminates a jacobian matrix step, it still requires heavy matrix-vector multiplication for a pre-conditioning step, making it suitable for GPU. This approach proved scalability starting from a 216-bus sized system, and it outperformed the newton based transient simulation solver PSAT. This approach showed better consistency performance enhancements for various sized systems compared to a similar work, where a pre-conditioning step was parallelized prior to the GMRES method [262]. Yet, using pre-conditioned GRMES in a combination of intime coarse-grained schemes and in-space fine-grained schemes with GPU acceleration as in [263] showed universal scalability whether the number of GPUs used or the problem size increased.
There have been a few GPU-based parallel EMTP-type simulators for EMT studies. Some integrating wind farms to the models [264]. Others added complicated controls to large-scale systems with PV, transformers, and reactive components [265]. GPUs were used primarily to solve the linear algebraic equations associated with the algorithm, while the CPU performed most of the other parts. However, "Full" GPU-based parallel solvers that parallelize numerous other steps of the algorithm were developed recently [266].

State-of-the-Art
The most relevant recent work in power system stability mainly involved SSA and EMT and it is summarized in Table 7. In [267], Zhou et al. continue their work in N-1 SSA, this time treating the DC power flow contingency screening part, where only branch thermal violation is accounted for. This study tries to apply the Critical Contingency list contingency screening on GPU. DCPF-based Contingency screening involves dense vector operations. This paper claims to be the first of its kind to present a novel GPU-accelerated algorithm for DC contingency screening, where they optimize the data transmission, parallelization, memory access, and CUDA streams in their algorithm. The presented algorithm was tested on 300-bus, 3012-bus, and 8503-bus systems. The hardware used was A Tesla K20C GPU, and the host was Intel Xeon E5-2620 2 GHz CPU. They compared the performance against a single-threaded CPU-based algorithm implemented on a higher-end Intel Core i7-3520M 2.90 GHz CPU and 4 GB memory notebook. Their largest test case exhibited a speedup of 47× over the sequential case, demonstrating the scalability of their approach. They achieved this by reducing the data transmission by ×50, further optimizing task allocation and thread/block redistribution, and using memory coalescing to enhance memory access. The last improvement is particularly important, as memory handling is often ignored in the field, yet it is very crucial. Figure 11 demonstrates the impact category of the coalescing strategy used. Single threads often access different chunks of memory locations at the same time; when GPUs are most efficient when multiple threads access contiguous memory locations at the same time, this is called coalesced memory access.
To tackle the same problem, Chen et al. [268] use a slightly different GPU implementation, which exhibits a pipelined fashion. They employ a two-layered parallel method.
In the first layer, they apply a hierarchical path tree parallel LU decomposition for all contingency cases. In the second layer, they solve the decomposed problems for every contingency in parallel. The first layer process is repeated for every contingency case in parallel, sending groups of threads for each contingency. This means that the same process will run for several contingencies simultaneously, subject to the GPU's number of thread blocks allowed to run simultaneously. They employed an asynchronous scheme in which the CPU performs convergence checks. It receives the output convergence of three cases simultaneously; if one contingency calculation converges, the next one in line is sent to that available thread block. In their work, they pay attention to data management and utilize the cache architecture of GPU to improve their process. They compare their GPU approach against a KLU-based commercial suit that solves the problem on the CPU. In the largest case and using 32 GPU thread groups, their algorithm shows a 9.22× speedup over a single-thread CPU solution and a 3× speedup on a four-threaded CPU solution. In this study, the importance of matching thread number to warp number is demonstrated because using 32-thread groups did not make a massive difference to the speedup against 16-thread groups. In [269], Zhou et al. expand on their previous work and attempt to skip the screening step, directly solving ACPF for all cases in a batch-ACPF solver. Their framework packages the batch-ACPF into a new problem through which a high degree of parallelism can be achieved. They created dependency graphs and used a QR left-looking algorithm for its numerical stability compared to LU decomposition. They compared their solver against commercial multi-threaded CPU-based SLS solvers KLU and PARADISO in several test cases. On an 8503-bus system, their solver took 2.5 s vs. 9.9 s (4× speedup) on a 12-threaded KLU solution and 144.8 s (57.6× speedup) on a sequential single-threaded solution. At face value, a GPU approach is superior. However, due to memory bandwidth limitations, adding more than ten CPU solvers would not have enhanced the commercial solver performance. Thus, performance-wise (core to core), this conclusion cannot be drawn. Economically speaking, the GPU is far superior, as adding a single GPU is much cheaper than several equivalents compute nodes.
Following the trend of resurrecting iterative solvers for PF applications mentioned earlier, Zhou et al. designed a GPU-accelerated Preconditioned Conjugate Gradient solver to solve the same ACPF for N-1 SSA in 2020. They tested their GPU-designed algorithm on 118, 1354, 2869, and 10,828 bus systems, the last system being the East China power grid. The number of contingencies was, respectively, 177, 1728, 4204, and 11,062. They compared their approach to two solution algorithms: 1-Complete LU SSA solution, which was implemented on a single core. 2-Rank-one update-based solution implemented on a single, four, and eight cores supported by a multi-threaded solver. The hardware specs were similar to previous studies. With their GPU implementation of this algorithm on the East China power grid, the solution speedup was 4.9× compared to the eight-core multithreaded CPU implementation. Zhou's works show that solvers for power system-related problems tailored for GPUs have considerable potential.
When it comes to EMT, the first record of GPU use is in [270], where co-processing of vector operations for 117 bus networks on GPU had double the speed of PSCAD, a CPU-based software. In later work, they reduced the communication between the CPU and GPU and derived a GPU-specific algorithm to achieve close to ×40 speedup on a 3861 bus network [271]. A similar implementation that aimed to reduce communication time and fully and efficiently exploit the SIMD architecture in EMT was also conducted in [272].
Earlier work by Zhou et al. was a GPU implementation of Electromagnetic Transient Simulation [273]. They utilize both SIMD and scalar operations and emphasize the importance of avoiding simultaneous memory access when parallel processing. The homogeneity of tasks was ensured where elements that are modeled similarly were grouped; for example, all RLC elements were processed in the same kernel with a unified lumped model. Separate Kernels were made for the Universal Line Model (ULM) in four stages, where in every stage, kernels were executed simultaneously. The Unified Machine Model (UMM) simulation, the third ubiquitous task, was also divided and managed in detail. The level of parallelism in this work is massive and attempts to squeeze every parallel structure in the problem and an ounce of performance from the GPU. The algorithm was tested on eight test cases from 40 buses up to 2458 bus systems. They used a NIVIDIA Tesla C2050 GPU with 448 cores and 3 GB memory and an AMD Phenom 4 core 3.2 GHz CPU. The total simulation time was 100 milliseconds at 20 microseconds. On that setup, a speedup of 5.63× was achieved compared to an optimized commercial software EMTP-RV.
Finally, in one of the most impactful papers by the same authors, a huge test case containing 240,000 buses was decomposed with propagation delay [274]. The system was divided into linear, non-linear, and control subsystems. Additionally, the Jacobian domain and the voltage calculations were parallelized, creating another decomposition layer, resulting in a highly fine-grained problem. Two GP104 GPUs were used in which all the iterations were processed, and convergence was checked. Using a single GPU against the EMTP-RV resulted in a 15× speedup. Moreover, GPU linear scaling was demonstrated by achieving double the speedup (30×) by adding the second GPU. However, it is important to note that the 240,000 bus system was an augmentation of the IEEE-39 bus system. Table 7. Power System Stability SIMD-based state-of-the-art studies. [267] N-1 SSA with DC power flow branch contingency screening [268] A two-layered parallel method which exhibits a pipelined fashion [269] Direct ACPF solve in a batch-ACPF skipping screening [270] First GPU parallel EMT [271] Improvement over EMT with reduced CPU-GPU communication [272] Fully and efficiently exploit the SIMD architecture in EMT [273] Utilize SIMD and scalar operations avoiding simultaneous memory access. [274] Fine-grained algorithm by parallelizing the Jacobian domain and the voltage calculations

System State Estimation
Power SSE is a centerpiece of control centers. Thousands of voltage measurements are collected from SCADA systems and Phasor Measurement Unit (PMU) and processed to understand the conditions of the grid better. SSE studies provide accurate and reliable estimates of the phase angle and bus voltages from incomplete system measurements.

Development
Parallelizing SSE, much like other studies, started with simulations. Y Wallach and E. Handschin proposed a distributed master-slave topology [275], showing that merely partitioning the network would achieve speed gains. Later in 1982, C. W. Brice and R. K. Cavin Simulated the potential performance of distributed and decentralized algorithms on parallel hardware for state estimation, where one is communication intensive, and the other is computationally intensive [276]. Different SSE decomposition techniques and parallel simulations were carried out over the 1980s and 1990s. Those include block partitioning to decompose the state estimation problem by the network simulating the performance on a MIMD machine [277], parallel forward-backward substitution [278], recursive quadratic programming [279], the Dantzig-Wolfe Decomposition Algorithm [280], and other simulations [281]. The first practical implementation was in the year 2000 using APP [282,283].
The Weighted Least Square (WLS) algorithm is the most commonly used in SSE. WLS contains a matrix inversion step, which can be solved using LU decomposition. The first parallel WLS solver implementation might have been in [284], where shared memory vs. MPI schemes solving the linear system of equations were compared. The exploitation of parallelism in the Khan Filtering Method only showed up in 2009 [285], although it has been in use since the 1970s to improve the prediction aspect of SSE.

State-of-the-Art
The most relevant work made in this area on MIMD is by Korres et al. They used an efficient distributed WLS algorithm to perform multi-area state estimation in parallel using MPI [286]. They tested the algorithm with several processors from 1 up to 60, solving problems using the scientific toolkit PETSc, which contains parallel optimization linear and non-linear solvers. They employed different communication/coordination strategies and control area partition numbers and sizes to estimate the state of an 1180 bus system (10× IEEE-118 system) in two cases, where case 2 exhibited more interconnections between "slave" areas. The algorithm was implemented on the National Technical University of Athens cluster consisting of 11 Intel Core 2 Duo E8200 PC nodes. However, in this work, scalability was demonstrated, but it lacked a speedup comparison against the fastest sequential algorithm.

Development
Most of the development of state estimation occurred over MIMD studies. The earliest significant SIMD application occurred in [245], where the PULSE IV, a scalable SIMD Chip, was designed to help achieve faster-than-real-time application in 2003. Table 8 summarizes the critical work of this section. Several uncommon SSE techniques were related to GPU, such as the Fast Decoupled State Estimation [287] and selective column identification in the numerical differentiation [288,289]. The WLS algorithm is the most commonly used for electrical SSE. WLS contains a matrix inversion step, which can be solved using LU decomposition. That is what Karimipour et al. did in [290], in which they implemented all the steps of the solution algorithm from the Admittance matrix formation to the convergence check on GPU. The GPU used was a 512-core NIVIDIA with double-precision peak performance and the Intel Xeon E5-2610 2GHz CPU as a host. The test system used was an IEEE-39-bus system, duplicated and interconnected to create larger systems of up to 4992 buses. They achieved a speedup of up to 38 for the largest system compared to a sequential CPU implementation. The algorithm exhibited strong scalability, and they estimated that the maximum theoretical speedup achievable by that GPU is 312× for this algorithm. Notably, they also address the issue of solution discrepancy due to the hardware architectural difference. They did this by considering both Correlated and Uncorrelated Gaussian Noise in the measurement samples (to consider bias) in small test cases, which is supposed to lead to larger errors in the final result. The errors that occurred in the GPU solution matched those in the CPU solution, which confirmed the robustness of their algorithm.

State-of-the-Art
Later Karimipour et al. produced a GPU parallelized Dynamic State Estimation based on Kalman Filters [291] on a Tesla S2050 GPU. Compared to a quad-core CPU, the approach achieved a speedup of ×10 for a 19,968 bus system with 5120 generators exhibiting close to zero error of estimation. They extended and refined the approach by increasing the GPU portion of work and utilizing PMUs, and SCADA measurements [292]. For a smaller system (4992-bus), they achieved a higher speedup (15×) with high voltage precision (0.002 p.u. and 0.05 rad error). Finally, they made the algorithm robust against coordinated false data injection, enabling its detection through the parallel algorithm and optimized, secure PMU installation [293].
In [294], the Dishonest Gauss Method in the WLS algorithm is used, where the Jacobian update is not executed at every iteration. The algorithm was implemented on a Tesla K20c GPU, fragmenting the original by vectorizing multiplication and multi-threading addition processes. They investigate the method's accuracy and show that to get an accuracy of 100%, the Jacobian needs to be updated every seven iterations. They also demonstrate the algorithm's robustness by applying different noise levels to determine its effect on accuracy, showing the method ranges between 98-100% at different levels of noise. A complete mathematical analysis of the convergence of their method was conducted in [295]. Real-Time Digital Simulators measurements were used while adding errors that vary from 1-15%, and 30 samples per second were taken, a typical PMU sampling Rate. On an IEEE 118 system and a three-second duration, their algorithm achieved a 15× speedup over what is claimed to be the best existing CPU implementation. Table 8. State Estimation SIMD-based state-of-the-art studies.

Paper
Contribution [287] Parallel fast decoupled state estimation [288,289] Parallel selective column identification in the numerical differentiation [290] Implemented all the steps of WLS algorithm on GPU [291] GPU parallelized Dynamic State Estimation based on Kalman Filters [292] Increasing the GPU portion of work and utilizing PMUs, and SCADA measurements [293] Enable PMU false data detection through a parallel algorithm [294] Fragmenting Dishonest Gauss method by parallelizing multiplication and addition [295] Mathematical analysis of the convergence of their parallel method [294]

SIMD-Based Studies
Almost all of the previous studies involve classical power system optimization problems, the ones showcased in Figure 12. These can be augmented and combined in various ways to achieve various objectives. Thus, a few unique and computationally expensive formulations in the literature sprouted with parallel implementation as summarized in Tables 9 and 10. A common one is an OPF problem that ensures the voltage stability in the solution, the Transient Stability Constrained OPF (TSCOPF). Adding such constraints complicates the problem, but it was shown that by using techniques such as Benders decomposition [296] and reduced space IPM [297], a remarkably faster solution can be achieved prior to parallelizing the subproblems.
In one formulation by [298], the TSCOPF is combined with UC to create a Transient Stability Constrained UC (TSCUC). The criterion of transient stability is given by the following inequality constraints: where COI refers to the moment of inertia of the rotor. The previous equations, along with others in the UC formulation, add temporal complexity. Thus, the TSCUC formulation was decomposed temporally using APP, and 24 cores were used to solve it on the IEEE-300 system. The model showed notable scalability; the solution time was reduced from 16 h to 1 h. Furthermore, using the same pre-defined contingency, the transient stability of the firsthour interval of the solution was examined and compared to a standard SCUC solution. The TSCUC solution maintained the whole system's stability without any alteration, whereas the SCUC solution failed to regain stability even by modifying the power output postsolution. This means that the proposed model guarantees stability as opposed to the conventional one. Frequency Figure 12. The occurrences of parallel classical power system studies. Table 9. Unique Formulations and Other SIMD-based state-of-the-art studies.
Paper Contribution [298] TSCUC (TSCOPF + UC) temporal decomposition using APP [299] ACOPF mitigating electromagnetic storm disturbances using APP Deviating slightly to the security side, interesting work by [299] investigates mitigating the disturbance effect of geomagnetic storms on power systems. Those manifest as low-frequency, quasi-DC, high-impact extreme events. Tools for large and complex power systems to mitigate this problem do not exist; thus, this paper proposes a parallel solution approach to conduct optimal secure operation planning considering geomagnetic storm disturbances (GMD). They include Geomagnetic Induced Current (GIC) into an ACOPF model and turn it into an MINLP by modeling GIC blocking components holding three states, bypass, resistor, and capacitor. APP is used to decompose the problem into two problems, an NLP, solving for ACOPF, and a MILP which determines the state of switchable network components and calculates the GIC flow. The two subproblems can be solved independently and in parallel, coupled by a third model. Tests were performed on a multi-core workstation showing the utility of this approach in mitigating GIC for a 150-bus network. However, the authors did address that their extensive experience with the method guarantees that their APP approach results in an effective solution.

MIMD-Based Studies
Table 10 summarizes the critical work of this section. There have been a few venturesome studies, such as in [300], which used GPUs to accelerate a stability analysis involving both transmission and distribution systems into the network, which achieved some performance improvement, but the speedup factor might not justify the use of power. In other formulations, the gas and thermal systems are modeled and coupled into the electrical system to calculate the energy flow using the Inexact Newtons method and GMRES pre-conditioning [301]. Contrary to previous studies on GPUs, this study showed that the larger the system, the more time is required. This is a testimony that care must be taken in routines and schemes that would scale with the hardware. Finally, the last notable work is in the AC TSC-OPF using GPU acceleration [302]. TSC-OPF is an extension to ACOPF with EMT dynamics and stability constraints. Below is the compact formulation of the problem: where u is the control variable, including active and reactive generation outputs. v 0 and v i (t) refer to steady and transient state variables of the ith contingency in the differential equation F i . n C is the number of contingencies. The constraints in the vector expressions above include the steady state and transient-state constraints, DAE, and initial conditions. The elaborate model can be found in [303]. The reduced-space IPM was used in this study, and it was the portion that was decomposed and parallelized on GPU using Schur's complement. They used a 12,951 bus system in the largest case and compared their GPUaccelerated algorithm to a single sequential core and parallel 16-core CPU implementations. Their algorithm achieved a speedup of ×24 and ×6.5, respectively. Table 10. Unique Formulations and Other MIMD-based state-of-the-art studies.

Paper Contribution
[300] Transmission and distribution system stability analysis on GPU [301] Thermal & electrical system coupling using Inexact NR and GMRES [302] AC TSC-OPF using GPU acceleration Table 11 summarizes the latest contributions in Grid and Cloud-based studies. The use of HPC facilities in the electrical power industry is not uncommon in various offline (and some online) applications, especially ones related to smart grids and microgrid planning [46]. For example, California Independent System Operator (CAISO) uses HPC to perform various real-time assessments of the network, such as reliability assessments [34]. Facing the huge computational load, ISO-NE installed an on-premise computer cluster in 2007 using EnFuzion as a job manager [304]. They later faced challenges in choosing the optimal size for clusters and investment in computational power, since the peak computing jobs and average ones were very different. Hence, it made sense to move some non-emergent applications to Cloud. In fact, ISO-NE had already initiated a project to adopt cloud computing with emphasis on achieving privacy and security [305,306]. When facilities begin to struggle to meet the increasing requirement of deployed power system applications, it makes sense to resort to cloud services. Cloud computing really expands the realm in which algorithms and systems can be parallelized and exhausted. Regulators and players will not have to worry about the availability of resources any more. Instead, they will squeeze out every inch of performance and manage the "rented" resources. Unneeded capital investment can be avoided, and real-time data can be shared with third parties.

Grid and Cloud-Computing-Based Studies
When it comes to Grid and Cloud computing studies, performance enhancement is often sought through scalability and resource availability rather than optimizing for specific hardware. While this works very well, the combination of fine optimization would be much more powerful. However, this might be only possible through the Grid rather than the Cloud model, since it is more controllable. Most of the work in this area is very recent, but it starts with a few studies on Grid Computing. In an application that is very similar to the Cloud computing paradigm, the work by Morante et al. [307] might have been the first modular and hardware scalable implementation of parallel contingency analysis on a grid of eight heterogeneous computers. A middleware called Hierarchical Metacomputers (HiMM) was used to allocate resources economically based on resource adequacy and a given budget value. By increasing the budget value, their middleware managed to lower the execution time by exploiting more expensive, more powerful resources. Other papers from the time explored the idea of monitoring and control of the power system using decentralized schemes on grid computing [308]. A few more studies explored Grid-based frameworks and applications, such as Huang et al. [309,310] and Ali M. [311], load flow on Grid by Al-Khannak et al. [312], and dynamic security assessment by Xingzhi Wang et al. [313].
There were a few recent studies that showcased large-scale cloud implementations. One study was part of a diverse paper showcasing the challenges and experiences gained by ISO-Newengland in moving to cloud services. In their move, they used Axceleon CloudFuzion [314] job balancer, which provides high failure tolerance and job monitoring. The work involves heuristics and operational decisions, providing a great insight into the methodologies and equations used to choose the number of instances and squeeze every bit out of the rented hour. An N-2 contingency analysis was performed on a test case that takes 470 h on a regular workstation; the case jobs were carried out in less than an hour with their scheme. CloudFuzaion was not flawless, however, as its workflow was often interrupted by manual steps (which meant it had to be monitored). Thus, ISO-NE started a project with Axceleon to develop an independent power system simulation platform for cloud computing that addresses that issue, fully automating processes after receiving the user input. In a 2019 study [304], they demonstrated that their platform managed to run multiple instances reaching near 100% CPU utilization of the instances launched for certain jobs and capable of many task computing and co-simulations.
Security becomes a major issue when using cloud facilities, and while the work above used a service-level security mechanism, Sarker and Wang [315] wanted to ensure security, assuming that the cloud in-house security infrastructure is compromised. They transform the ED problem into a Confidentiality-Preserving Linear Programming (CPLP) formulation [316,317] to achieve holistic security, such that all sensitive information remains unknown by competitors. The approach protects against attacks from passive and active entities on the Cloud (administrators and customers). It works by converting inequality to equality constraints and multiplying the coefficients by randomly generated positive real numbers twice (a mononomial matrix U then H), which are held privately. The resultant constraint matrix is sent to the Cloud, and the equipment information implied in the constraint coefficients remains obscure to any attackers. This work enhances the security matrix reduction of CPLP. Since the feasible region of the CPLP that is produced after those operations is the same as the original LP problem, solving for those new constraints, (the CPLP) yields the same solution as the LP. A test of the algorithm was performed on a 2383-bus Polish system, including 327 generators solved using CPLEX, comparing its performance on four different cloud instances. The method showed scalability, but it was not tested against a regular ED algorithm.
Another paradigm that Cloud facilitates is the Many Task Computing paradigm. It facilitates Co-Simulations, which involve solving many optimization problems and performing many studies apart and then connecting them. In [318], they perform a large co-simulation by decomposing a network into heterogeneous partitions that are unique to each other, creating different problems for each partition (e.g., generators, passive components, loads, etc.). The dynamic resource allocation ability fits well with large-scale co-simulations because 1-different components have different transient reactions. 2-They might require different timesteps depending on transient status. 3-Each problem could have a different formulation (NLP, MILP, etc.) and require different solution times. The paper demonstrates the achievable co-simulation performance and interfacing on the Cloud using existing commercial tools. For example, in one instance, the network was divided into multiple Simulink models, launching Matlab script simulations in different processes. In another trial, a compiled MPI C code was used, and Simulink executables were to run the simulation.
In [319], a fine scope was taken on task management of massive parallel contingency analysis using the Hadoop Distributed File System on the Cloud. They applied an N-1 transmission line contingency analysis and used the NR method to solve the power flows. First, the system distributes the contingency and other parameters to separate nodes such that each node solves a contingency case. What is unique about their job management scheme is that when the number of cores increases, the network bandwidth automatically increases as well, further increasing the performance. With this approach, they could perform a full AC contingency analysis for a real network in less than 40 s. Paper Contribution [307] First modular and hardware scalable implementation of parallel contingency analysis [314] Showcase the challenges and experiences by ISO-Newengland in moving to cloud [304] Many task computing management, near 100% utilization of CPU instances [315] ED specific cloud security algorithmic re-reinforcement [318] Large EMT co-simulation decomposing a network into heterogeneous partitions [319] Parallel contingency analysis using Cloud Hadoop Distributed File System

Smart Grid and Renewable Integration Applications
A summary of the contributions of studies covered in this section is availalbe in Table 12. In 2010, the literature almost completely shifted towards cloud computing and particularly an integrated framework combining smart grids with the Cloud, given the advent of AMI and big data at the dawn of that year. Several frameworks and models for smart grid co-ordination [320,321], and power system assessment [206] using Cloud appeared that year and later [322]. Ideas such as cloud-based demand response were being explored [323], and many papers suggested network architecture and control topologies that are realizable with cloud [324]. Concepts such as Cloud assisted IoT could help us achieve a much more efficient network of sensors for future power systems. An example of such a system is an architecture for RPL-based IoT application, which specifies the application of RPL focusing on reforming industrial operations through cutting-edge technologies [325]. The versatility of cloud infrastructure is an excellent complement to the smart grid future, and its applications are covered in this review as it involves distributed work, which is the core of parallel computation. The diagram in Figure 13 shows a typical multi-layer vision shared across the Cloud integrated smart grid literature.
Interesting network paradigms could be created given that computational resources can be flexible and scaled as cloud services provide. Sheikhi [326] explores the idea of an "Energy Hub" where customers can be active in Demand Response Management by reducing their direct electricity consumption and using the output of the combined heat and power from the energy hub that the gas supplier supplies. This does not change customers' electricity consumption level, but the demand has reduced from the electrical supplier's point of view. This Energy Hub + Smart meter is now a Smart Energy Hub, and single or multiple customers could share it. The problem was formulated in a game theory approach where the Smart Energy hub is a price anticipator, which tries to predict the consequences of its own action on the price and chooses the optimal load-shifting schedule to reduce the cost on customers on those bases. In a similar fashion to [327], the smart energy hubs read and control the outputs and send data to the Cloud to be aggregated and computed for decision-making, solving the game according to the cost function. They simulated their approach and showed that it resulted in a decrease in energy price compared to no Demand Side Management (DSM) game. They also compared the communication cost of direct messaging configuration vs. cloud configuration, where the Cloud showed lower cost, making the platform more suitable for such applications. For certain applications, such as DSM, fixed resources become an even greater issue as the amount of information processing and the computational requirement fluctuates based on the availability/flexibility of demand-side resources, which dictates the complexity of the problem at every instance. One of the options that are becoming more attractive is using cloud computing services, which could be much cheaper than expanding existing facilities. Using such services means an optimal allocation of the computational resource becomes much more crucial as cloud services are often billed based on the consumed resources, and pay-as-you-go terms [328]. In 2016 Z. Cao et al. [329] handled this issue with a source allocation algorithm that finds the optimal cloud computing resources for DSM instances. Commercial cloud computing resources differ from regular HPC clusters in the sense that there exists a greater variety in architecture, and the performance compactness might be lower than that of specialized HPCs used in research.
On a much more refined and more local scale, Wang et al. [330] attempt to decentralize the problem of Dynamic Economic Dispatch (DED), i.e., energy management in real time, by using inverter Digital Signal Processor (DSP) chips and cloud computing. The paper solves a multi-parametric quadratic programming optimization problem, which has been highly applied in the area of coordinated power system ED and TSO-DSO network coordinated dispatch. The solution involves two parts and is decomposed into two subproblems: 1-An offline calculation that Cloud carries out. 2-Real-time decision making that the DSP carries out. In the cloud computing part, distributed renewable generation and loads are forecasted to create piecewise expressions. Every 4 h, the expressions and information are sent to inverters so that the DSP chip can solve and optimize the output based on the real-time input of load and RE. The Cloud provides flexibility and handles the highest computation burden while simplifying the subproblem solved by the inverter.
Using a 14-node test case with PV, wind, Grid, diesel, and battery systems, the authors drew a comparison between their approach and a traditional implementation on an i7 regular laptop. Amazon Web Services (AWS) instances were created, and a real DSP chip was used. Their test showed that by moving offline computations to Cloud, it was solved within 34 µs compared to the traditional algorithm (372 ms). This is a colossal speedup, but it might not be fair, since the traditional algorithm creates a whole new deterministic problem every time it collects new values. The main gist is that it achieves the needed solution time for using DSP, since the calculations on the inverter must be lower than 100 µs so as not to cause issues and interruptions. Sharing the inverter's chip rather than adding local controllers lowers investment and maintenance costs, and the distributed nature makes it robust against single-point failure. However, care needs to be taken in the job timing such that the control functionality of the inverter is not interrupted.
Addressing the security concerns of the Cloud, F. Ma et al. also proposed a costoriented model to optimize computing resource allocation, specifically for demand-side management problems using simulated annealing and modified priority list algorithms [33]. The objective function parameters are based on actual Amazon cloud service pricing. This cost-oriented model was compared to a traditional O2O model, which allocates resources based on the peak computational load for the renting period. The proposed optimization method showed a significant cost reduction over the traditional source allocation method. There is a security concern that comes with outsourcing sensitive processes. For other players in a free market, there is an economic benefit to engaging in cyberattacks and accessing information from competitor rs' processes, such as ED, as it could help with their bidding strategies. In their study, they explain the use of the Virtual Private Cloud (VPC) scheme, which isolates their portion of the Cloud such that their resources are not shared with other organizations or applications, even if idle. It is supposed to increase the security of the outsourcing process. Yet, one can see how the spread of such a strategy would create an impediment to the scalability and efficiency of the Cloud. Paper Contribution [326] Explores the idea of an "Energy Hub" with active demand response [327] Energy hub cloud-based aggregated computation [329] Optimal cloud resource allocation based on available demand response [330] Decentralization of Dynamic Economic Dispatch using Digital Signal Processors on AMI [33] Demand side management using simulated annealing and modified priority list

Discussion
Parallel Applications for power systems started showing up around the late 1960s and early 1970s, around the same time when a commercial market for supercomputers and clusters was sprouting. At that stage, parallel computers were still experimental in nature, and their design often targeted a specific problem type or structure. Very few computers were suitable for power system studies, as most had low arithmetic precision that is equal to or less than 32-bit, which has been shown to be inadequate for direct solution methods [331].
At an abstract level, computer hardware architecture and its uses in power system studies are still the same. What used to be a "computer" or "host" is today's CPU, and SIMDs such as array processors were used just like today's GPUs would be used for power system studies. Algorithms that include diakoptics/tearing and tree graphs used to be a common theme at the start of vectorization and fine-grained parallelism, and it is still used in current GPU power system studies. Another example of similarity is that one of the issues faced at the time was that transient simulation timestep iterations sometimes required substantial logic and data to model for each node [332]. This means it would create a burden on the computer that hosts the array processes and could cause communication bottlenecks. This is very analogous to what happens today in GPU-CPU optimization algorithms. Ironically, S. Jose argued in 1982 [332] for the need for a general-purpose processor to tackle the previous issues since vector/array computers pose software hurdles and challenges that are too great to justify the enhancements achieved. Yet the same challenges are faced today, just at a different scale and magnitude (i.e., GPU-CPU interfacing/Cloud Implementations). A major shift in the field occurred around the 1990s; around the same time, general-purpose processors experienced significant innovation and cost reduction, and more parallel optimization algorithms started appearing. Studies in power system stability became abundant, and UC algorithms debuted with most papers using metaheuristics to solve the problem. While implementation would have been arguably doable, simulations of parallel hardware still existed because more care was placed on implementation optimization and ensuring the practicality/portability of the parallel algorithms. The meaning or extent of what is considered coarse-grained and fine-grained algorithms shifted over time.
The main direction for HPC incorporation in power system studies applications is moving towards real-time applications, much more so than offline applications. From the literature, it seems that renewable energy generation is the urgent driver for resorting to using AC formulations in real-time applications, followed by annual cost savings of replacing DC formulations. Benders decomposition and Lagrangian relaxation seem to be the most common combination in decomposing stochastic full AC problems. In larger systems, the application of parallel computation is clearly more advantageous, while in smaller systems, serial programming performs better or at least matches parallel computational approaches, mainly due to the communication overhead, as an increased number of processes means longer and more communication time between them. The extent of this effect depends highly on the strategies used in parallelization, as well as the cluster architecture and hardware used. GPUs, for example, exhibit extreme parallelism in processing architecture, yet have superior performances over more coarse CPU implementation shown previously [333]. Many organizations and research teams are developing public tools and frameworks to help incorporate HPC into power system studies. PNNL developed HIPPO [334], a tool to help grid operators tackle SCUC by leveraging optimization algorithms for HPC deployment. PNNL also initialized the development of another framework for power system simulations called GridPack, which falls under a larger suite called GridOPTICS [335]. While such tools facilitate the use of multi-processor parallelization, others such as Nividia CUDA [336] evolved GPUs-which have an immensely parallel architecture-to become easily programmable and sprouted the trend of using GPUs for scientific calculations showing a promising future. This trend can be observed in Figure 14. On-premise HPC is not future proof, as the grid organism keeps on evolving. A power system with n components, with each component having m states, can have m n over all possible states. The Grid is quickly adding more components in terms of quantity and variety, AMI, EVs, IoT, etc. All power studies will keep on growing, and control rooms and operators will also need immediate visualizations for easy information analysis. This means that power system operators will inevitably resort to Cloud services. However, cloud computing has many of its own challenges related to policy, security, and cooperation before any solid adaptation is made. The Optimal placement of data centers depends on various stochastic factors, and the lack of interoperability between providers of cloud services does not make this problem any easier. Regulatory compliance in terms of security and access is extremely hard to ensure. Data and process locations are unknown, and it becomes hard to investigate any dysfunction or intrusion. An efficient recovery mechanism needs to always be in place, and even if the host company's structure or ownership changes, long-term data viability must be in place. Moreover, the business case for moving to cloud computing needs to be established first, which is different for every entity, and it is difficult to predict the future costs of the services. Lots of preparation and tools need to be created locally to ensure stable operation and inseparability and security, such as handling software licensing issues and data coordination/processing. Computation aggregation evolved from a single processor to a processor and accelerator to a multi-processor system, Beowulf clusters, and multi-core processors, then a grid. Even at a small level, much like the way vector arrays and ALUs were added to processors, future CPUs and GPUs will be integrated into the same device, and the cycle continues. In the future, the Cloud will be an integral part of all operational entities, including the electrical industry. The future electrical Grid and Cloud will look very different from today; both will be dynamic and transactive and will have a reciprocal relationship in which the Cloud acts as the brain of the electrical network, and both will probably be driven by similar forces.

Software and Solvers
Commercial solver use can be traced back to the sixties with solvers such as the LP/90/94 [337] in conjunction with the development of the field of mathematical programming. Thus, today, there is an abundance of open-source and commercial solvers that race to employ the best techniques to solve standard problem formulations. This is evident in Figure 15, showing the variety of solvers used. To a large degree, commercial solvers simplified optimization for engineers, allowing them to focus on modeling, leading to the subfield of model decomposition. Nevertheless, a few challenges arise when using commercial solvers instead of employing a specific solution algorithm to the problem. The heuristics involved in solver design could create a vast disparity in performance, even for solvers within the same caliber solving the same type of problem. Additionally, the ability of a solver to identify and exploit the structure of the model heavily determines whether the model can be solved within a reasonable time. If the solver fails to accomplish this step, it might exhibit exponential growth in running time, as indicated by complexity analysis. Moreover, hidden bugs and issues with the source code of the solvers could exist. This is particularly true for commercial solvers.
Established commercial solvers with full-time development teams, such as CPLEX and Gurobi, exhibit a more comprehensive dictionary of identifiable problem structures to accommodate the large user base. They are robust, scalable, and capable of handling large search spaces with multithreading and HPC exploitation capabilities. Moreover, they are easy to install and interface with many programming languages. Figure 15 shows the hierarchy of occurrences of different solvers in the surveyed literature, and it can be observed that the previously mentioned solvers dominate the literature for the previously mentioned reasons. However, this should not deter us from experimenting with noncommercial solvers, as they may be superior for specific problems. It is also worth noting that all the well-established general solvers in the tier of CPLEX and Gurobi are CPU based, and none exploit GPUs in their processes, which is an area worth exploring [338].
Compiled and procedural languages, such as C and Fortran, dominate the literature due to their superior performance, as shown in Figure 16a. However, other multi-paradigm multi-paradigm and object-oriented languages (Matlab and python) started to infiltrate the literature due to their simplicity and convenient libraries. Other concurrent programmingoriented languages that might be of interest include Charm, Chapel, Cython, and Julia. Chapel has more advanced parallelism than Julia, while Julia has gained huge popularity since its recent release. Julia is expected to populate future literature due to its heavy emphasis on optimization and C-like performance. In terms of Parallel APIs, the fast adaptation of CUDA as shown in Figure 16b testifies the thirst for massive throughput and suggests that in terms of GPU-based power system optimization studies, there is more to come. There exist some integrated high-level frameworks designed to scale certain power system studies on HPC, such as BELTISTOS [129], which solves multi-period, securityconstrained, and stochastic OPF problems incorporating a multilevel solution strategy implemented in PARADISO. However, when compared to GridPack, this framework seems quite limited. As part of the HIPPO project mentioned earlier [199], PNNL developed the software framework GridPack TM that lowers the barrier for power system research and analysis in creating parallel models for HPC implementation [339].
Grid pack automates processes such as determining the Y-Bus of the network and solving PF equations, integrating algebraic differential equations, coupling simulation components, distributing network and matrix representations of the network, and employing linear and non-linear solvers. GridPack has a partitioner that partitions the network module buses into several processors, where it maximizes the interconnections between buses within the same processor and minimizes the ones between separate processors. It is based on "Parmetis" partitioning software, which achieves graph mesh partitioning, matrix reordering, etc. The matrices of the distributed matrices of the partitioned network are then distributed by mappers, which determine the contribution buses and branches from each processor by getting the dimensions and locations of elements. The math module generates those matrices and supplies linear and non-linear solvers built on the PETSc library. GridPack also has libraries of already developed, ready-to-use parallel applications. This includes different types of contingency analysis, initialization of dynamic simulation, power flow, and voltage stability analysis.

Challenges in the Literature
This review did not delve into deep comparisons between the different approaches due to the lack of standardization in various aspects of the studies, making it hard to draw meaningful comparisons. These challenges start with network topologies, sizes, a difference in hardware, and a mere lack of information. This is further discussed later in this subsection and can be observed in Table S1. One common point to note is the strong focus on simplifying power flow equations which the dominance of Gauss-Newton and LU-decomposition methods in the literature entails, as shown in Figure 17.  First, the variety of test cases between studies causes a solution universality issue. Many solution approaches exploit the structure and properties of the problem, such as sparsity and asymmetry, which vary with different network topologies and the number of bus interconnections. The effect of that was evident in [122] where the parallel FDPF scheme performed better on the Pan-European system than the Polish system because it had a more orderly topology. Some studies boast remarkable results on massive synthetic bus systems that are an augmentation of the same small bus system connected with tie lines, such as in [274], where the IEEE-39 case was copied and connected over 6000 times. This creates a level of symmetry that does not exist in natural systems, one that certainly affects the performance rendition. Moreover, some SIMD-based studies use made-up or modified power systems that are very dense, which suits what the hardware is designed for but presents a false or exaggerated sense of performance accomplishment, since real systems are generally sparse. Moreover, the speedup metric is often used unfairly. For example, in decomposition algorithms, solving subproblems is parallelized, then the performance comparison is drawn against the same serial algorithm (i.e., the scalability equation used but referred to as speedup). The parallel algorithm must be compared to the best serial algorithm that can achieve the same task to have a truly fair comparison. This alludes to the fact that there are several approaches for the speedup metric, which should be discussed in parallel algorithm studies for higher transparency, as shown in [256]. Additionally, there are cases in which superlinear speedup is achieved in some studies. This is often due to a playfield change and deep modifications in algorithms' features, leading to unfair comparisons. For example, when the parallel algorithm's cache memory usage is optimized or distributed, memory is used, allowing faster access to memory than serial processes.
The second challenge is the lack of details in the experimentation setup essential for replication. Some papers provide the model of the hardware used without the number of threads or processes used and vice versa. Others claim a parallel application without mentioning the communication scheme used or the number of subproblems created. Furthermore, some papers use or compare iterative algorithms such as "Traditional ADMM/LR" or other generic algorithms without providing the user-adjusted parameters/heuristics involved in tweaking such algorithms, making it impossible to replicate and verify the results. Additionally, barely any of the studies explicitly mention the number and type of constraints and variables generated by the formulation and test cases used. There is also no standardization in the metrics used to evaluate performance. Some studies use absolute speedup; some use relative speedups. Some compare their parallel approach to a different parallel approach, which is weak because the comparison loses its meaning if the proposed parallel approach is inferior to a sequential one. Some suffice by comparing their own parallelized approach to itself applied sequentially (scalability metric), which is problematic because a decomposed task could perform worse than a coupled one when applied sequentially.
The third challenge is the lack of parallel implementation of long-term grid planning models akin to Transmission System Planning or Generation System Planning or their combination. This type of study that helps us plan the transition to the future network struggles with a very small number of factors, accuracy, and uncertainty, and on small test cases that many studies started resorting to decomposition algorithms [340], mainly benders decomposition to decouple investment variables in the models. Yet, it seems that almost none of the studies use or resort to parallel and high-performance computing, which is a huge lost opportunity, as we need to add as many factors as possible to find the real optimal path of transitioning and investment given all the future policies technologies and scenarios that we can speculate at the moment.
The final significant challenge is the lack of standardization in the software and hardware used in the studies. The main issue with software is the variety of solvers used in papers that employ model decomposition schemes. Commercial solvers operate as black boxes that use different techniques, some of which are trade secrets. They are coded with different efficacies and have their bugs and problems, amplifying the confusion in interpretation.
The lack of hardware standardization in the literature has been highlighted since the 1990s [341]. Even very recently, within supposedly comparative work where four different parallel schemes were compared, each scheme was performed on a different supercomputer, and a different test case [248]. The single study that provided a meaningful cross-hardware comparison was [108]. They experimented with different CUDA routines on different but closely related NIVIDIA GPU models, showing that their approach was not superior on every model and proving the importance of hardware normalization.
One way to help tackle the challenge of hardware standardization is by using cloud service instances, such as AWS, as a benchmark, as they are easily accessible globally. Especially since Virtual CPUs (vCPUs) handle the standardization of the heterogeneous hardware and usage (an Elastic Compute Unit is equivalent to the computing power of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon Processor [342]. Moreover, it fits the industry's trend of shifting computational power to the Cloud. Additionally, these services come with metric tools that allow the user to look into the actual hardware usage and CPU and memory efficiency of their algorithm. This leads us to the last point: most previous studies merely glance over memory and treat memory resources as a bottleneck rather than a shared and finite resource. More focus on data and memory efficiency is needed. Future studies need to mention the maximum amount of data that needs to be processed and the actual memory usage of their approaches.

Future and Recommendations
At this point, the role cloud computation would play in power system HPC applications, and the future Grid is almost unquestionable due to its sheer scale and versatility. The Cloud acting as the brain of the Grid fits the notion of the living organism envisioned by many for the Future Grid. In a few studies, the Cloud has been recognized as the centerpiece for distributed computing paradigms such as fog computing and AMI resource leveraging. With that said, the increasing dependence on centralized cloud computing services is antithetical to the goal of energy decentralization/independence. Yet the trend could not be more natural, manifesting a cycle that almost recreates the onset of electrical generation monopolies. The decentralization of computational resources for power systems over micro users, however, does not seem to be that far-fetched of an idea, especially with currently existing applications such as blockchains and volunteer computing.
A lot of the earlier parallel computing studies for power systems modified or created the hardware around algorithms used [81]. This hardware manipulation to suit limited computational purposes is making a comeback due to moors law and other limitations. Analog computing hardware is making a comeback, as it is way more efficient in matrix multiplications. Rather than turning on and off, analog transistors encode a range of numbers based on the conductance magnitude, which is dictated by the gate. Their level of precision, however, makes them mainly suitable for AI chips and algorithms. Provided that their future precision becomes comparable to digital computers, they might be a contender for GPUs in mathematical optimization matrix operations.
The advancement in Quantum Computing research is creating a creeping disruptor of classical computation and algorithms as we know them. It has been shown that current Quantum Computers can solve combinatorial optimization problems that resemble ones related to energy system problems. Namely on the IBMs D-WAVE solving facility location problem [343]. The potential of applying quantum computation for dynamic stability simulations, OPF, and UC was discussed in the early 2000s [344]. In fact, the mixedinteger quadratic UC problem can be transformed into a Quadratic Unconstrained Binary Optimization (QUBO) by discretizing the problem space, a form that can be turned into a quantum program. In a merely experimental effort, this was actually implemented by [343]. The test systems were very small, from 3 to 12 units, and the solutions of the D-Wave were accurate for a smaller number of units, but quickly started deviating. The DC power flow was also implemented on the D-Wave with an HHL algorithm process on a 3-bus test case showing accuracy [345]. These might be the first experimental efforts employing quantum computation for operational power system studies.
Looking back at the studies, one can observe that our current parallel studies do not come anywhere near covering the potential variables of the future Grid. The models are highly simplified and filled with assumptions. The amount of detail, planning factors, and uncertainties are not close to what needs to be considered in grid modernization and future transition. Yet the accuracy and computational performance of the solutions are sometimes unimpressive. Even when decomposition techniques are used, and the created parallel structures are exploited with HPC, we are often faced with not-so-impressive outcomes, probably due to the lack of understanding and ingenuity in employing the parallel and decomposition and parallel techniques. The hardware that is used in many of the studies is often limited to a multi-core processor limiting the potential throughput. A complicated brain is needed to operate the complicated organism that is the future Grid. In the face of the Grid transformational changes, the power system community needs to start heavily adapting HPC techniques and utilization, incorporating them into future operational and planning studies. Moreover, a high level of transparency and collaboration is needed to accelerate the adaptation of parallel techniques, making such knowledge the norm in power system studies for the Future Grid.
The lack of standardization makes it very hard to replicate techniques from different works. Therefore, it is very important to have a standard framework and minimum information requirements in future power system study publications. This is especially important to ensure published models and techniques' validity, given the scientific reproducibility crisis [346][347][348]. The following points should serve as a guideline for future parallel studies in the field:

1.
A small validation test case, including any modifications, should be presented with all of the parameters and results. 2.
All the model expressions must be fully indexed without brevity or detail omissions. This includes both the model pre and post-decomposition. If possible, the full extended model specific to the validation test case should be provided.

3.
The pseudocode of the algorithm and flow chart demonstrating the parallel task splitting and synchronization should be included. The values of any tuning factors or heuristic parameters used should be provided.

5.
A test should be carried out on test cases incrementally increasing in size with a variety of network topologies to demonstrate the scalability and universality of the proposed method. An effort should be made to compare the speedup to the fastest known algorithm.

Conclusions
In this article, the beginnings of parallel computation and its appearances in power system studies were recounted, and the recent research and literature were reviewed. Several past reviews were cited, distinguishing this work from previously conducted research. The significance of hardware, paradigms and the history of parallel computing was then discussed. Studies of parallel power systems were summarized later, starting by reciting the development of studies up until the 21st century, with emphasis on the most impactful papers from the last decade. Studies included analyses of the stability of the power systems, state estimation and operation of the power systems, and market optimization. The state-of-the-art was also discussed, highlighting the need for standardization in the literature and showcasing the future of computation in power system studies. Given the grid modernization and transition towards net-zero emission, power systems are becoming increasingly complex, and resorting to high-performance and parallel computing and cloud computing has never been more important.
Supplementary Materials: The following supporting information can be downloaded at: www.mdpi. com/xxx/s1, Table S1: HPC in Power System Literature Data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: