Exploiting Coarse-Grained Parallelism Using Cloud Computing in Massive Power Flow Computation

: We present a novel architecture of parallel contingency analysis that accelerates massive power ﬂow computation using cloud computing. It leverages cloud computing to investigate huge power systems of various and potential contingencies. Contingency analysis is undertaken to assess the impact of failure of power system components; thus, extensive contingency analysis is required to ensure that power systems operate safely and reliably. Since many calculations are required to analyze possible contingencies under various conditions, the computation time of contingency analysis increases tremendously if either the power system is large or cascading outage analysis is needed. We also introduce a task management optimization to minimize load imbalances between computing resources while reducing communication and synchronization overheads. Our experiment shows that the proposed architecture exhibits a performance improvement of up to 35.32 × on 256 cores in the contingency analysis of a real power system, i.e., KEPCO2015 (the Korean power system), by using a cloud computing system. According to our analysis of the task execution behaviors, we conﬁrmed that the performance can be enhanced further by employing additional computing resources.


Introduction
As the complexity of power systems increases with the demand for electric power, the use of advanced power system equipment is becoming more widespread. Furthermore, power systems have become significantly more complicated following the integration of renewable energy resources, such as solar and wind [1]; the employment of high-speed operating devices, such as flexible AC transmission systems (FACTS) and high voltage direct current (HVDC) [2]; and the increasing prevalence of electric vehicles. In addition, although end users have only acted as electricity consumers in previous decades, smart grid environments enable them to take advantage of information technology (IT) and advanced power system facilities to become independent electric power suppliers [3]. In particular, the adoption of an energy storage system (ESS) provides opportunities for installing a variety of distributed power sources in the grid [4]. Hence, it is plausible that the complexity of power systems will increase continuously.
Contingency analysis, which is used to assess the effects of power system component failures in advance, is conducted to enable better power system management. Unfortunately, the number of computations required to calculate the power flow increases with the complexity of the power system [5]. There have been many studies on methods and applications of power flow computations,

•
We propose a task management optimization to improve performance and scalability of the parallel contingency analysis on a Hadoop-based cloud computing system. • We obtained a 37-fold or greater performance improvement on the Amazon Web Service (AWS) environment.
The rest of the paper is organized as follows: In Section 2, we describe the background of our study and briefly explain cloud computing, parallelism granularity, and power flow computation. We present related works in Section 3. In Section 4, we introduce the overall architecture of our contingency analysis method, the N-R method, and the motivation for this study. In Section 5, we provide a detailed description of a massive cloud computing-based power flow computation. We evaluate the performance of our method in terms of speedup and present task execution behavior in Section 6. Finally, conclusions are provided in Section 7.

Cloud Computing
Cloud computing is an Internet-based computing technology that facilitates ubiquitous access to a shared and distributed pool of configurable computing resources such as storage, networks, servers, applications, services, and so on [24][25][26][27]. Cloud computing allows users to process and store data in a privately owned cloud or third-party server located within a data center. It improves the performance of applications through enhanced manageability and the requirement for less maintenance; thus, it also contributes to meeting additional user requirements. Figure 1 presents a brief conceptual diagram of cloud computing. The cloud computing environment primarily consists of infrastructure, platform, and software; on-demand computing resources are delivered to clients in the form of these services such as Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS) [28,29].
• SaaS: Cloud-based applications that are run on remote computers in the cloud and are maintained by a third party. In general, users access these applications via their web browsers. • PaaS: All of the resources are provided to deliver web-based cloud applications without the cost and complexity associated with establishing and maintaining the underlying hardware, software, provisioning, and hosting.

Parallelism Granularity
In parallel computing, parallelism granularity indicates a qualitative measure of the ratio of computation to communication [30]. Synchronization events usually separate periods of computation from periods of communication. The granularity of parallelism is mainly classified into two categories based on the amount of work that is performed by a parallel task: fine-grained and coarse-grained [31,32]. Figure 2 presents an example of obtaining fine-and coarse-grained parallelism from a single sequential program. As shown in Figure 2c, in fine-grained parallelism, a subprogram is divided into a large number of small tasks that are individually assigned to, and then executed by, multiple processors. As each of the tasks is small, and the tasks are distributed evenly across the processors, fine-grained parallelism significantly improves load balancing, but it also has the disadvantage of increasing communication and synchronization overheads. Hence, fine-grained parallelism is suitable for use with shared memory architectures, e.g., GPUs, that provide fast communication between multiple processors [33,34]. Programmers rarely make explicit use of this parallelism, so in general the parallelism is controlled by compiler technology [35].  In a sequential program, the tasks are executed from top to bottom in order over time. Figure 2b illustrates that coarse-grained parallelism divides one program into large tasks. Since subprograms with a large number of tasks are assigned to different processors, the parallelism confers the benefits of low communication and synchronization overheads, but incurs load imbalance problems. Message-passing architecture is definitely suitable for coarse-grained parallelism, since it requires a long time to transfer data between multiple processes [36]. We need to separate and parallelize a program ourselves based on a full understanding of its code; thus, it is somewhat challenging to exploit this parallelism [37].

Related Works
Typically, power system analysis is based on contingency analysis, which is in turn based on the results of power flow computations. As mentioned earlier, the time required to compute power flows increases with the size and complexity of the power system. In addition, the number of contingencies requiring analysis also tends to increase for stable power system operation.
As the performance of computers continues to improve, parallel processing is not necessary when conducting power flow computations for a small number of contingencies. However, when analyzing a large number of contingencies, it is advantageous to adopt parallel processing and thus reduce the computation time. For instance, it would be useful to perform N − k contingency analysis in parallel. The computational complexity of this analysis is very large, so it is appropriate to process the power flow computations in parallel.
The basic concepts related to the application of parallel processing to power flow computation are explained in [10,11]. Parallel contingency analysis in high performance computing environments is studied in [12], where an effective method for conducting contingency analysis using parallel processing is presented. In [13], a dynamic computational load balancing method is proposed using high performance computing machines for a large number of computations.
There have been attempts to improve the performance of parallel computing using multiple machines. Recently, several large-scale projects using parallel computing have been performed for power system analysis. In [16], contingency analysis is performed using a commercial program in a massively parallel environment. A total of 4127 contingencies were scaled to 4127 cores and the solution time is reduced significantly. In [17], a computational platform for dynamic security assessment of a large power system with a commercial analysis tool is presented as a part of the iTesla project; 10,000 cores are used in the case study in this project.
The need for high computing power to analyze a power system and its effects is demonstrated by the results of previous studies. However, there is a physical constraint whereby many large-scale connected computing machines are needed. In this paper, we propose a power system analysis platform using cloud computing as a solution to this limit, without the need for a commercial analysis tool, and we obtain reasonable results without physical constraints by using the cloud computing environment.
There have also been a number of previous studies on how to distribute calculations for parallel processing, and most previous research has focused on applying matrix calculation to parallel computing [10,18,19]. These studies are based on fine-grained parallelism that assigns some of the calculations. Another method for exploiting fine-grained parallelism is to perform power flow computations using GPUs. The GPU-based technique is an example of a parallel processing method that increases the speed of power flow computations. In [20], the basics of parallel computing using GPUs are established, and a method for utilizing GPUs is presented in [21]. A detailed implementation of power flow computation parallel processing done using GPUs is provided in [22,23], with reasonable results. Using GPUs is an effective method for reducing the time taken for power flow computations, but does not exploit the coarse-grained method of assigning all calculations. In addition, there is a limit to the increase in the speed because the overall effect of the parallel processing depends on the performance of the hardware. To overcome these weaknesses, we herein propose a method to perform power flow computations in a cloud computing environment without specific hardware configurations and exploit coarse-grained parallelism to perform entire computations in parallel.

Overall Architecture of Contingency Analysis
Algorithm 1 describes the structure of a conventional contingency analysis that performs the power flow computation of a power system for N different contingencies. The inputs of the algorithm are Ctg and D in , which denote a list of contingency data and cluster data of the power system, respectively. The output is D out , which is a list of convergence results of the power system for all contingencies. D tmp indicates the result of applying one contingency to the cluster data and D out i is the result of the i-th power flow computation.
This analysis algorithm mainly consists of the following two procedures. First, it updates the cluster data of a power system using the i-th contingency data, i.e., Ctg i , at Line 5. Second, it obtains a convergence result of the updated cluster data by performing a power flow computation at Line 7. Since the algorithm repeats the previous two procedures for all contingency data, its execution time linearly increases in proportion to the number of contingencies, i.e., N.

Algorithm 1 Contingency analysis.
INPUT Ctg: list of contingency data, D in : cluster data. OUTPUT D out : list of convergence results.

Newton-Raphson Based Power Flow Computation
The N-R method can be used to determine the magnitude and angle of the voltage at each bus in the power system. A brief description of the N-R method and its application is as follows.
The general form of the nodal equation for a power system network is: where Y bus is the admittance matrix of the network, I is a current vector, and V is a voltage vector. The ith equation, describing the flow at bus i, is: A complex power equation can be expressed as follows: It is necessary to derive power balance equations before we can solve the power flow problem. The method for deriving power balance equations is discussed in detail in [38], so we only provide the derived equations. At bus i, the power balance equations are The N-R method is an iterative technique for solving a set of nonlinear equations. We can use Equations (4) and (5), which are nonlinear power balance equations, to calculate the power flow solutions. The N-R method is described in detail in [38]. The matrix form of the resulting linear set of equations is: where J is the Jacobian. We can obtain the complete formulation of the power mismatch using the power balance equations in Equations (4) and (5) and the N-R method from Equation (6), as follows: The N-R method converges to the solution after fewer iterations than the G-S method, and is more accurate. In addition, it exhibits better convergence during the computation, and is more efficient in large-scale systems. However, it usually requires more computation per iteration than other power flow computation methods, so the computation time per iteration is slightly longer. To improve the performance and feasible size of the computations by implementing parallel processing, we selected the N-R method for the power flow computations in this study. Figure 3 shows the execution time of contingency analysis for a variety of power systems, including a real power system, i.e., KEPCO2015. For each power system, we performed the contingency analysis by applying the contingencies from each branch in turn. In other words, we computed the power flow of a power system with N − 1 branches, where N is the total number of branches of the system. Hence, we employed 17, 32, 64, 170, 306, and 2589 contingencies to evaluate the execution time of contingency analysis for power systems, IEEE14, IEEE24, IEEE57, IEEE118, IEEE300, and KEPCO2015, respectively. The specifications of the power systems are briefly outlined in Table 1, which describes how many buses, generators, and branches are involved in each system.

Motivation
These results show that the contingency analysis takes less than 1 s in most of the IEEE systems, except for IEEE300, which requires up to 10.914 s. However, we obtained a completely different result for KEPCO2015, which represents the state of a real power system located in South Korea in 2015. Due to its large-scale cluster data including buses, branches, generators, and so on, the contingency analysis requires a considerably long execution time, i.e., 1636.589 s. Further, if we assume that contingencies from more than two different branches can take place at the same time, the execution time increases significantly because the number of contingencies to be analyzed grows exponentially. The contingency analysis in most real power systems consumes a significantly large amount of execution time, and acceleration of this analysis should be studied.
In this paper, we propose a parallelized contingency analysis using cloud computing by exploiting the coarse-grained parallelism of independent contingencies from each other.

Massive Power Flow Computation Using Cloud Computing
In this section, we describe how to perform massive power flow computations using cloud computing. We explain how to obtain coarse-grained parallelism of the power flow computation with the MapReduce framework. In addition, we present a case study that performs parallel contingency analysis on a Hadoop platform [39].

Exploiting Coarse-Grained Parallelism with MapReduce
Some previous works on power flow analysis parallelization have obtained fine-grained parallelism by parallelizing a conjugate gradient algorithm for solving a linear system, i.e., a system of the form Ax = b, which is the most time-consuming part of the power flow analysis. However, performance improvements derived from the fine-grained parallelism are restricted by computing resource limitations if a large power flow analysis, e.g., contingency analysis, is performed. In addition, some other studies have presented to exploit coarse-grained parallelism by independently performing multiple contingency analysis on multi-processors in parallel. Although they employ well-known parallel programming models such as OpenMP (Open Multi-Processing), MPI (Message Passing Interface), and so on, it is still very challenging for users to obtain significant coarse-grained parallelism due to inevitable load imbalance and communication overhead.
We propose to resolve this problem by exploiting coarse-grained parallelism within a MapReduce framework on cloud computing. We can significantly improve the performance of a massive power flow analysis by simultaneously executing multiple independent tasks via the enormous computing capability of cloud computing. In particular, since most of the recent MapReduce-based cloud computing environments also provide GPU-based computing capabilities that make it easy to implement fine-grained parallelism [40], we were able to maximize the performance of our method by applying it orthogonally to the existing fine-grained parallelism. MapReduce is one of the most prominent programming models used to produce and manipulate large datasets by executing parallel and distributed algorithms on computing clusters. In general, a MapReduce system is designed to be maintained on many distributed servers, to perform multiple tasks in parallel, to manage communication between the different components of the system, and to support redundancy and fault tolerance. Figure 4 presents a brief overview of a MapReduce program. A MapReduce program principally consists of a map function that filters and sorts the input data, and a reduce function that integrates the intermediate data from the map function into the output data. Each split input item is assigned to a mapper by a task scheduler, and the mapping results are output to intermediate files. The files are shuffled, partitioned, and then input to the reducers, which assemble the output data. To obtain significant coarse-grained parallelism, we need to divide the input data to minimize communication and synchronization overheads so that each of the mappers performs a large number of tasks independent from each other.

Case Study: Hadoop-Based Parallel Contingency Analysis
Our parallel contingency analysis program, which uses Hadoop infrastructure to perform massive power flow computations, is outlined schematically in Figure 5. The input data is stored in a Hadoop distributed file system (HDFS) comprised of the following two types of data: a set of contingency data and a set of cluster data. Each contingency is assigned to a different mapper, but the cluster data are shared by all of the mappers.
First, each mapper updates the cluster data with the given contingency data, and then performs power flow analysis to estimate the convergence of the power system specified by the cluster data. The mapper returns a key and value pair, i.e., a contingency ID (CID) and the maximum mismatch, for each iteration of the N-R algorithm. In our implementation, the N-R algorithm is implemented in C/C++ and is accessed using a Java native interface (JNI) in the mapper, i.e.,the map method. Second, the intermediate results from the mappers are delivered to reducers after being shuffled and partitioned. Then, each reducer takes a pair of key and list values, i.e., a CID and a list of mismatch values, and estimates whether the cluster converges at the contingency that is indicated by the CID. The reducer also returns a key and value pair, i.e., a CID and the result of the convergence analysis result. Finally, the results of the convergence analysis for all contingencies are merged together and written to an output file.

Performance Evaluation
In this section, we describe the experimental environment in detail and evaluate the performance of our Hadoop-based parallel contingency analysis.

Experimental Environment
In Table 2, we describe the organization of the Hadoop-based cloud computing system that we employed to assess the performance of our proposed parallel contingency analysis. We assembled the Hadoop system using remote computing resources from Amazon Elastic Compute Cloud (EC2), which are provided by AWS. We used Cloudera's open-source Apache Hadoop distribution, i.e., CDH, which provides an Apache Hadoop framework such as Hadoop YARN, HDFS, and so on.
In detail, the Hadoop system occupies a single NameNode and 16 DataNodes. A NameNode is designed to provide a unified namespace that the HDFS uses to manage access to the data files in the distributed file system. It also supports file system operations, such as open, close, and rename, and maps split data files, i.e., data blocks, to DataNodes. The DataNodes redundantly store the data blocks in their storage, execute read and write operations on the blocks, and implement the map and reduce procedures.
For the evaluation, we designed each of the DataNodes to utilize 2, 4, 8, and 16 cores of an Intel Xeon E5-2686 v4. Hence, we performed the experiments with a large number of cores, i.e., 32, 64, 128, and 256, by using 16 DataNodes together. Further, since we adopted M4 instances of AWS EC2, we expected to use different network bandwidths of 450, 750, 1000, and 2000 Mbps when using 32, 64, 128, and 256 cores, respectively. We performed 16 reduce tasks, which we started after more than 90% of all map tasks were complete.

Task Management Optimization
As mentioned in Section 2.2, to maximize the performance improvement from coarse-grained parallelism in parallel contingency analysis, we are required to minimize load imbalance among computing resources while diminishing the overheads of communication and synchronization [41]. We propose a task management optimization that splits all contingencies into as many chunks as there are available cores in advance, and then assigns each of the chunks to a map task to be executed on a single core. By adopting optimization instead of employing a single map task for each contingency, we can improve the performance degradation caused by task management overhead on a Hadoop platform. Figure 6 shows the performance improvement of our parallel contingency analysis of KEPCO2015 on 32, 64, 128, and 256 cores with respect to conventional contingency analysis. The gray bar represents the speedup when adopting task management optimization and the white bar shows the results with no optimization. Without the optimization, we obtained speedups of 1.78×, 3.26×, 6.62×, and 12.85× on 32, 64, 128, and 256 cores, respectively. The speedup improved by slightly more than a factor of two because the network bandwidth increased from 750 to 1000 Mbps when the number of cores increased from 64 to 128.  Furthermore, the task management optimization strategy significantly enhanced the performance of the parallel contingency analysis. We improved the performance by up to 35.32× on 256 cores and achieved the speedup of 8.00×, 17.98×, and 27.28× on 32, 64, and 128 cores, respectively. These results confirm that our proposed architecture yields significant performance improvements by exploiting the coarse-grained parallelism of contingency analysis using cloud computing. In particular, as the speedup increases proportionally to the number of cores used, we expect that additional computing resources will lead to further performance enhancements. Figure 7 shows the task execution behavior of our parallel contingency analysis of KEPCO2015, both with and without applying the task management optimization. Each line indicates the total completed tasks over time for the different numbers of cores, i.e., 32, 64, 128, and 256 cores. The completed tasks involve both map and reduce tasks. As shown in Figure 7a, when the optimization was not adopted, the number of completed tasks became identical to the number of contingencies in KEPCO2015, i.e., 2589, and the total execution time decreased in proportion to the number of employed cores, which increased from 32 to 256.   Figure 7b shows that the number of completed tasks becomes equal to the number of occupied cores through task management optimization. This is because we created and executed as many map tasks as the number of cores, as mentioned in Section 6.2. We can also confirm that the optimization reduced the overall execution time for completing all tasks significantly. When using 256 cores, the number of completed tasks increased relatively sharply in the first round, i.e., between 19 s and 29 s, and then rose gradually over the next few seconds of the second round. In detail, after 90% of all the map tasks were completed in the first round, 16 reduce tasks were created and executed in the second round. Finally, from 33 s onwards, the number of completed tasks started to rise quickly again as the reduce tasks began to finish. Although different numbers of cores were used, similar behaviors are shown in each graph.

Conclusions
In this paper, we propose a novel parallel architecture for accelerating contingency analysis in real power systems. To the best of our knowledge, our study is the first to suggest a cloud computing platform for massive power flow computations by exploiting coarse-grained parallelism. We also propose a task management optimization that divides all contingencies into as many chunks as there are available cores in advance, and then assigns each of the chunks to a different core. This optimization allows minimizing load imbalances between computing resources while diminishing the overheads imposed by communication and synchronization.
As a result, we obtained a significant performance improvement in the contingency analysis of a real power system, i.e., KEPCO2015, by using a Hadoop-based cloud computing system with 1 NameNode and 16 DataNodes on Amazon EC2. The experimental results show that our proposed contingency analysis achieved speedups of 8.00×, 17.98×, 27.28×, and 35.32× on 32, 64, 128, and 256 cores, respectively, when adopting task management optimization. Furthermore, from task execution behaviors, we can expect to enhance the performance further by employing additional computing resources, since the total execution time of tasks is decreased in proportion to the number of cores used.
Different from the conventional fine-and coarse-grained parallelization technologies with ultimately limited hardware resources such as GPU, local servers, and so on, our proposed architecture can tremendously accelerate the analysis of large power systems that require massive power flow computations by using unlimited cloud computing.