Fast and Accurate Solution of Integral Formulations of Large MQS Problems Based on Hybrid OpenMP–MPI Parallelization

Featured Application: Optimized parallel solution of large magneto-quasi-static problems. Abstract: This paper proposes an optimal strategy to parallelize the solution of large 3D magneto-quasi-static (MQS) problems, by combining the MPI and OpenMP approaches. The studied numerical problem comes from a weak-form integral formulation of a MQS problem and is ﬁnally cast in terms of a large linear system to be solved by means of a direct method. For this purpose, two main tasks are identiﬁed: the assembly and the inversion of the matrices. The paper focuses on the optimization of the resources required for assembling the matrices, by exploiting the feature of a hybrid OpenMP–MPI approach. Speciﬁcally, the job is shared between clusters of nodes in parallel by adopting an OpenMP paradigm at the node level and a MPI one at the process level between nodes. Compared with other solutions, such as pure MPI, this hybrid parallelization optimizes the available resources, with respect to the speed, allocated memory, and the communication between nodes. These advantages are clearly observed in the case studies analyzed in this paper, coming from the study of large plasma fusion machines, such as the fusion reactor ITER. Indeed, the MQS problems associated with such applications are characterized by a huge computational cost that requires parallel computing approaches.


Introduction
Low-frequency electromagnetic problems in the so-called magneto-quasi-static (MQS) limit (such as, for instance, eddy current problems) can be conveniently solved through numerical models coming from integral formulations, with the advantage of limiting the meshing to the conducting regions only. For real-world applications, however, the final formulation likely leads to large linear systems of equations, and suitable strategies are required to lower the computational cost of the numerical solution, such as those based on fast Fourier transform [1], fast multipole method [2], and H-matrix [3]. However, there are cases of interest where the dimensions are not so large to require the use of such techniques. For such cases, a direct solution method is preferable, since it provides intrinsic robustness and accuracy, with a modest increase in computational cost.
The final goal of this work is the optimization of the solution of integral formulations of eddy current problems solved with direct methods, in applications where long transients must be studied, and where an effective pre-conditioning is not possible. The applications targeted here come from the world of the nuclear fusion machines, and specifically refer to the evaluation of the current density induced in the conducting structures as a consequence of an electromagnetic transient associated with plasma disruptive events. The proposed optimization is based on the parallelization of the tasks associated with the numerical implementation of the integral formulation, based on a hybrid OpenMP-MPI approach.
The classical approach in parallel computing was based on the assumption that the systems were homogeneous, with each single node sharing the same characteristics: in this case, the computation time required by each node would be the same, if the same tasks were assigned to each single node. The present approach is based on the use of computational resources with high performance, made by cluster multicore systems [4], where variability and heterogeneity between the cores and between nodes introduces major issues, such as load imbalance, that strongly affect the effectiveness of parallel computation [5]. Optimizing the resources is of a primary importance for an efficient use of supercomputers, that are usually shared among several users. Traditional load-balancing techniques, such as those based on global re-partitioning of the threads, require a priori estimations on the time needed for completing each single task or, alternatively, require a dynamical reallocation of the loads based on a real-time check of the status of the nodes [6].
More recently, this problem has been addressed by using hybrid parallel programming paradigms, like those that adopt both a parametrized communication paradigm, such as the message passing interface, MPI [7], and an application programming interface, such as OpenMP [8,9]. Such a hybrid programming approach has been used in several fields, such as computational fluid dynamics [10,11], chemical simulations [12,13], geotechnics [14], high efficiency video coding [15], and electromagnetic simulations [16][17][18], due to its very interesting performance [19]. As pointed out in [19,20], the advantages of using a hybrid OpenMP-MPI programming approach are many, as follows: (i) suitability to the architecture of modern supercomputers, with interconnected shared memory nodes; (ii) improved scalability; (iii) optimization of the total consumed memory; (iv) reduction in the number of MPI processes. On the other hand, a few disadvantages can be observed, as follows: (i) complexity of the implementation; (ii) total gains in performance. These two issues require special care in the implementation of a hybrid approach, to handle the added extra costs to the existing code.
The main contribution of this paper is that of defining and assessing the implementation of such an approach when solving the integral formulation of eddy current problems by means of a Galerkin-based finite elements method, such as CARIDDI code, which is widely used in the fusion community [21]. The proposed solution fully exploits the main advantages of both paradigms: the MPI approach is more efficient in the process-level parallelization, whereas the OpenMP approach optimizes the lower level parallelization. The performance improvement is not only related to memory saving, but also to the reduction in the communication [22], with a consequent dramatic lowering of the latency time for the job to go in running state. The effectiveness of the proposed approach is witnessed by the obtained speed-up values for the analyzed real-world applications, about ×30 and ×50.
The paper is organized as it follows: Section 2 provides a short description of the numerical formulation of the MQS problem, and the tasks associated with building and solving the final numerical system are discussed. In Section 3, two different approaches for parallelizing the assembly of the system's matrices are presented and compared: a pure MPI approach and the proposed hybrid OpenMP-MPI one. In Section 4, first a benchmark test to validate the hybrid approach is provided. Then, the analysis of case studies referred to large nuclear fusion machines is carried out, comparing the computational performance of pure MPI and OpenMP-MPI approaches.

Numerical Formulation of the Magneto-Quasi-Static Problem
An electromagnetic problem in the so-called magneto-quasi-static problem (MQS) limit is studied here; namely, a low frequency eddy currents problem in non-magnetic conductors: these are typical conditions encountered in the applications analyzed in this paper, that are related to the plasma fusion machines, such as the fusion reactor ITER [23]. The mathematical model may be cast in integral form by introducing a vector potential, A, and a scalar one, φ, to define the electrical field, E, and the magnetic one, B, as follows: In this way, Faraday's law is implicitly imposed. The vector potential is calculated from the current density, J, flowing inside the conductors' region, V C , and from the density, J S , of the currents located outside that region, as follows: The integral formulation of the problem comes from the imposition of the constitutive equation J = σE in the ohmic conductors by means of the weighted residual approach, as follows: The subspace S contains the functions solenoidal in V C , satisfying the interface conditions for the current at the boundary: S = J ∈ L 2 div , ∇ · J = 0 in V C , J ·n = 0 on ∂V C . By replacing (1) and (2) in (3), the final weak form is obtained, as follows: The above formulation is implemented in the code CARIDDI [21], that is here adopted to perform the numerical analysis of the considered case studies. The numerical model implemented in CARIDDI solves Equation (4) by applying Galerkin's method, starting from the following decomposition in edge elements, N k , of the unknown current density, as follows: By using (5) in (4), under the assumption of time-harmonic signals, the vector, I, of the unknown coefficients, I k , can be calculated by solving the following linear system: The resistance R and inductance L matrices in (6) are calculated by means of the following integrals in the conducting regions, V C : whereas V 0 is a known vector that is computed from the external imposed sources, as follows: Once the numerical problem (6) is solved, the flux density B is obtained at any given position r as B = Q·I, where the generic entry of the matrix Q is defined as follows: The solution of the linear system (6) via a direct method requires a first step of matrix assembly, followed by a final step of matrix inversion. Specifically, the following four tasks may be identified: (a) assembly of the known-terms vector, V 0 ; (b) assembly of the resistance and inductance matrices, R and L; (c) assembly of the flux density matrix, Q; (d) inversion of the impedance matrix, Z, defined as in (6), via factorization and back substitution.
In real-world applications such as those analyzed in this paper, related to fusion machines, the high computational burden requires the use of efficient parallel computing techniques, as discussed in the next section. In this paper, we focus on the parallelization of the assembly phase; whereas, the matrix inversion is here performed by using the best available routines, such as the Scalapack one [24]. The "inversion phase" (task (d)) is usually the most time consuming when using a direct solver, but it is worth noticing that in many practical cases also tasks (a)-(c) in the "assembly phase" can be very demanding of both CPU and RAM resources. Indeed, if we denote with N do f the degrees of freedom of the numerical problem, the time required for the inversion (t i ) is proportional to N 3 do f , t i = k i N 3 do f ; whereas, the assembly time (t a ) is proportional to N 2 do f , i.e., t a = k a N 2 do f . Following this consideration, one can be brought to affirm that the assembly time is lower than the inversion one. However, in many applications, the actual CPU-times for the two phases are comparable, since k i depends only on the computational resources, whereas k a depends on the accuracy. Indeed, there are some cases of practical interest among the MQS problems where the required accuracy imposes values of k a that may lead to values of t a comparable to, or even greater than, t i . This happens, for instance, when the density of the electrical current varies rapidly in space and in time and consequently a high accuracy in the evaluation of the inductance matrix is required. Let us note that a higher accuracy may be achieved in two ways, as follows: (i) by refining the meshing, hence increasing N do f ; (ii) by using the same mesh (hence the same N do f ), but increasing the number of integration Gauss points, n. It is evident that solution (i) is not preferable when direct methods are used, given the cost of inversion. Let us then study the sensitivity of the assembly cost to increasing the values of n. The simulation times required for the inversion of (6) and for assembly of the matrix, L, are reported in Figure 1, as a function of N do f . For a lower accuracy (n = 1), t i is always greater than t a , but for a better accuracy (n > 1), the assembly time is longer than the inversion one for N do f < N * do f , where N * do f increases with n-it is 6500 for n = 270,000 for n = 3, and 351,000 for n = 4.

Parallelization Strategy Based on MPI Approach: Description and Limits
The numerical model of the MQS problem, described in the previous section, has been implemented in a parallel-computing scheme, based on a pure MPI approach, which is here briefly summarized (further details are given in [26]).
As for task (a), when assembling the known-terms vector, , the sources are the impressed currents, given in an external mesh (the sources mesh). The computation of the entries of requires a double loop: the first on the mesh elements, the second one on the external mesh (as shown in Algorithm 1). Actually, is a small size vector, and hence, although each MPI process allocates the whole vector in V0_loc, only a part of its memory is used. Then, a reduction operation is required in order to gather the final vector.
Algorithm 1 Evaluation of V0, pure MPI approach // Initialization Split Source Points between MPI processes // Parallel pure MPI computation for each MPI process do for each iel mesh element point do for each iel0 source mesh element do compute V0 contribution, between iel and iel0 end for end for end for // All reduce of the local contribute on the global V0 mpi_allreduce(V0_loc,V0_global) The assembly cost for task (b) is mainly related to the inductance matrix L, being R a sparse matrix [26]. The choice of the most suitable parallelization approach for computing the L matrix comes from a trade-off between the needs of optimizing the costs of computation, memory writing, and communications. For this purpose, two main strategies have been in the past investigated by the authors [26]: in the first one, the element by elements interactions needed to compute the integrals in (8) are distributed all over available MPI processes; in the second one, each process takes care of building a sub-block of the L matrix. Overall, numerical experiments in [26] show that the first approach proved to be the Another source of computational burden is the calculation of the matrix L, i.e., task (b): the double integration in (8) requires a proper handling of the singularities in the kernels, to retain the property of L to be positive definite. An adaptive procedure for integrating these singularities is given in [25].

Parallelization Strategy Based on MPI Approach: Description and Limits
The numerical model of the MQS problem, described in the previous section, has been implemented in a parallel-computing scheme, based on a pure MPI approach, which is here briefly summarized (further details are given in [26]).
As for task (a), when assembling the known-terms vector, V 0 , the sources are the impressed currents, given in an external mesh (the sources mesh). The computation of the entries of V 0 requires a double loop: the first on the mesh elements, the second one on the external mesh (as shown in Algorithm 1). Actually, V 0 is a small size vector, and hence, although each MPI process allocates the whole vector in V0_loc, only a part of its memory is used. Then, a reduction operation is required in order to gather the final vector. The assembly cost for task (b) is mainly related to the inductance matrix L, being R a sparse matrix [26]. The choice of the most suitable parallelization approach for computing the L matrix comes from a trade-off between the needs of optimizing the costs of computa-tion, memory writing, and communications. For this purpose, two main strategies have been in the past investigated by the authors [26]: in the first one, the element by elements interactions needed to compute the integrals in (8) are distributed all over available MPI processes; in the second one, each process takes care of building a sub-block of the L matrix. Overall, numerical experiments in [26] show that the first approach proved to be the most effective one, given the characteristics of the transient electromagnetic problems of interest in the study of nuclear fusion machines. For this reason, we have adopted such an approach, that is here implemented by means of a hybrid OpenMP-MPI paradigm.
Looking at its definition in (8), the assembly of L requires two nested loops on the mesh elements, to compute the element-element interactions, as shown in Algorithm 2. To this end, a critical point is the need to accumulate and store these interactions in a correct way. Three dummies' memories are defined: -M DMESH , used to store the geometrical mesh information. These data have a dimension of the order of magnitude of the number of the mesh elements; -M DL , used to temporarily store the entries L ij produced during the main loop of element-element interactions. This memory has the same dimension of M LOC , that is the chunk memory required for matrix storage at each node; -M DEE , used to carry on the main loop of element-element interactions: such a memory is of the order of N 2 do f ,el , being N do f ,el the number of degrees of freedom per element. As for task (c), the assembly of the flux density matrix, Q, is obtained through a double loop, the first on the mesh elements, and the second one on the requested field points, as shown in Algorithm 3. In a pure MPI approach, the set of field points are equally distributed among MPI processes. Memory is then allocated in each MPI process. As a general comment, we stress that with the pure MPI paradigm, due to the limited size of per node memory and to the need of each MPI process to allocate not only the memory required to store the local portion of the matrices but also the dummy memory required to carry on the computation, it may not be possible to launch a number of MPI process equal to the number of the physical cores available at the node. For the above reasons, many of these cores are forced to stay idle during the job execution because there is no more space to allocate. This fact limits the maximum achievable speed-up.
For example, to compute the L matrix, the mesh information should be available at each MPI process in the dummy memory MdMesh defined above: in practical applications, this memory likely reaches the dimensions of some GBs. In the following, we refer to the typical case of homogeneous supercomputing systems, made by nodes that are equal in terms of core number, memory, and operating frequency. In order to assess the performance, let us here define the following quantities: In the ideal case, the maximum computation speed-up would be equal to the total number of MPI processes, N MPI . At each node, the number of running processes is N MPI /N N , and thus the available memory is given as M LOC = M/N N . Note that the available memory must be used not only the matrix storage (chunk memory) but for any other data required by the process. Unfortunately, the dummy memory, M D , is replicated N MPI /N N times, due to the distributed memory approach adopted here. Therefore, the actual available memory at each node, M AV , is reduced with respect to M N , and must be larger than the requested chunk memory, M LOC . These conditions are summarized in (11), that sets the limit for the pure MPI approach by imposing a maximum value to N MPI :

Parallelization Strategy Based on Hybrid OpenMP-MPI Approach
A hybrid OpenMP-MPI approach is here proposed to overcome the abovementioned limits of the pure MPI approach. The underlying idea is to allocate in each node only one instance of the M D memory, and then to use all the physical cores only for computation. This means that we must logically separate allocation from computation, that is forbidden in a pure MPI paradigm, which maps one-to-one cores and processes. The solution could be provided by the use of the OpenMP environment directives: the node computation is broken down among the physical cores (threads) present in the node, that share the same memory on the node thanks to the shared memory approach implemented by OpenMP.
A hybrid OpenMP-MPI approach may be then implemented, based on the two following steps: (1) as in the pure MPI paradigm, the overall computation is partitioned in MPI processes, limiting the number of the processes at node level (in the ideal case this number would be 1); (2) the computational burden of each MPI process at node level is divided (again) in several threads, in accordance with the characteristics of the OpenMP paradigm.
In order to investigate the memory allocation scheme of the hybrid approach, we firstly recall that in OpenMP environment, each thread should allocate its local memory in order to face its own computation. Note that the dimension of the dummy memory per thread, M DT , is much lower than M D . In order to speed up the inter-thread computation and to fully gain from the modern CPU architecture, M DT should match the local core cache size (cache coherence, e.g., [27]). Let us now define as N T the number of threads per node: if we use one MPI process per node, and after considering that M D is allocated only once, Condition (11) on the available memory for the pure MPI approach is replaced for the hybrid OpenMP-MPI one by the following: In view of comparing the two approaches, it is convenient to introduce the following speed-up parameters: S MPI = N MPI (for pure MPI approach) and S H = N T N N (for hybrid OpenMP-MPI). They can be expressed from (11) and (12), as follows: The above relations set the upper bounds for the two speed-up parameters, that linearly depend on the number of nodes, N N : Figure 2 shows the corresponding straight lines, whose intersection occurs at: it is always S H > S MPI and the distance between the two bounds increases monotonically, as shown in Figure 2.  Indeed, this result can be demonstrated by considering the ratio between the slopes of the straight lines-from (13), as follows: Indeed, this result can be demonstrated by considering the ratio between the slopes of the straight lines-from (13), as follows: where the inequality holds since M DT < M D , and M D M N , hence M N /(M N − M D ) ≈ 1. As clearly shown in Figure 2, this hybrid OpenMP-MPI approach provides two advantages over the pure MPI approach, as follows: (i) resources saving-for a fixed speed-up S, the required number of nodes N N is lower; (ii) speed-up advantage-for fixed resources (N N ), the speed-up, S, is higher.
In addition, as pointed out in the Introduction, the approach also reduces the communication burden, because the parallel routines impose a fast communication between each thread in the node.
It is worth noticing that it is not trivial to accomplish OpenMP-MPI paradigm, since three critical issues arise, as follows: (i) Need for local thread memory-the main loop is distributed among all available threads, each of them requesting its local memory to work properly. This memory is related to mesh element information (e.g., the curls of the elements, the shape functions, geometrical information, element local output, and so on). Anyway, it is usually very small (few GBs) and, in cases of practical interest, is much smaller than the required output global node memory. (ii) Global matrix memory update-once the single thread made its own job, the thread local output should be accommodated into the global memory of the node. This is a non-trivial operation, and it is a bottleneck for the method, because the local update must be carried out by each thread in a sequential access, so to guarantee consistency. To this end, the CRITICAL OpenMP directive is used [6]. (iii) Global input memory access-this access is critical being shared between the threads.
It may cause "cache missing".
Of course, all of the above issues cause a degradation of the performances; moreover, as we can see in Section 4, they do not significantly affect the obtained speed-up parameters.
Let us now describe the changes introduced by the proposed hybrid approach in the algorithms for assembling the matrices and vectors of the Numerical Problem (6). In building the known-terms vector V 0 , a hybrid implementation of the external loop (over the mesh elements) is handled by all threads. The local memory V0_loc is automatically obtained by the means of OpenMP reduction operations (Algorithm 4).
The new algorithm to be used to assembly the matrix L is described in Algorithm 5. In the MQS applications, such as those analysed in this paper, the fully populated matrix L is by far the largest quantity. The entries of this matrix are arranged in a 2D cyclic block size fashion, in view of using the Scalapack inversion routine [24,26]. To this end, the entries MdL are reorganized in M LOC in the last step (final communication step). Once again, we stress the advantage of the hybrid approach over the pure MPI approach: since it is usually N N ≈ N MPI , the memories MdMesh and MdL (by far the largest ones) are allocated only once in the node, instead of being allocated for each process in the pure MPI approach.
Finally, the Q matrix can be very large if the computation of the flux density is required in a huge number of points and/or in each element of the mesh. With the hybrid approach, the external element loop is distributed among the threads in the current MPI process, see Algorithm 6.

Case Studies and Discussion
The hybrid OpenMP-MPI approach presented in this paper has been implemented in two parallel computing systems with different characteristics: a proprietary system (SUNCUDA Cluster) and a supercomputing facility (MARCONI Cluster) [28]. The details of such systems are given in Table 1. Note that the proposed hybrid paradigm is general and can be applied to computing systems with different features.

A Benchmark Case Study
A preliminary case, that is studied as a benchmark, refers to the simple MQS problem described in Figure 3, with a pancake coil that is inducing currents into a nearby conducting plate. The numerical model of the MQS problem is characterized by a mesh with 13,500 elements, 15,376 points, and 25,650 DoFs. Although the size is not so large to require a parallel computation approach, we analyse this case study with the only scope of comparing the performance of OpenMP only and OpenMP-MPI approaches.
The assembly of V0 and L (tasks (a) and (b) defined in Section 2) have been performed by using both OpenMP only and OpenMP-MPI approaches, and the results are provided in Table 2. Specifically, the speed-up obtained has been evaluated with respect to the use of 1 OpenMP thread and 1 MPI node, that is assumed as the reference. The computation times for the reference case are 8.93 s and 2693 s for tasks (a) and (b), respectively. Table 2 shows an almost linear increase in the speed-up values versus the number of threads, with a slight degradation obtained for task (b). The degradation comes from two major critical issues, as follows: (i) a larger thread memory is needed for evaluating integral (8), that is characterized by element-element interactions; (ii) the output of this calculation has to be stored at the node level in a matrix that is shared among all the threads. To guarantee consistency, the update of such a matrix is performed by a sequential access of the threads. The advantage of using the proposed hybrid OpenMP-MPI approach is the possibility to optimize the memory and to obtain the same speed-up for both the assembly tasks.
issues, as follows: (i) a larger thread memory is needed for evaluating integral (8), that is characterized by element-element interactions; (ii) the output of this calculation has to be stored at the node level in a matrix that is shared among all the threads. To guarantee consistency, the update of such a matrix is performed by a sequential access of the threads. The advantage of using the proposed hybrid OpenMP-MPI approach is the possibility to optimize the memory and to obtain the same speed-up for both the assembly tasks.

Case Study 1: A Plasma Ring
Case study 1 refers to the study of the interaction of a plasma ring subjected to an asymmetric kink instability in the presence of a conducting vacuum vessel and a set of toroidal field coils, as shown in Figure 4. The kink is described by assuming the plasma as a single tilted and shifted filament, with respect to its axisymmetric equilibrium position. In Figure 4a, we show a sector of 90 • of the total mesh, together with the surface enclosing the plasma rings necessary for integrating the Maxwell stress tensor. In Figure 4b, we show the time history of the x component of the resultant force due to the interaction between the plasma ring and the toroidal field, computed by means of the j × B expression and by using the Maxwell stress tensor. We notice that this method is very efficient in presence of a 3D plasma current distribution with time-varying shape and position. For this case, we adopted a mesh with 44,200 elements, 88,400 nodes, 12,600 points, and N do f = 44,202. For eddy current problems in nuclear fusion machines, such as in this case study and in case study 2, discussed later on, a satisfactory estimation of the current density, j, in the conductors, comes from the choice of linear basis functions and a number of Gauss points equal to 2, in the numerical computation of the integrals-see [21].

Case Study 2: Fusion Reactor Eddy Currents Analysis
Case study 2 is taken from a typical eddy current problem in nuclear fusion machines, associated with the fast electromagnetic transients produced by disruption of the confined plasma. The reference machine is the fusion reactor ITER [18]. By exploiting the symmetry and periodicity of the problem, the computational domain can be limited to a sector of 40° of the tokamak (Figure 5a). The main conducting components falling into such a sector have been modelled, such as the vacuum vessel, the thermal shield, the coils, and the supports, etc. The 3D mesh consists of 80,573 elements, 138,762 nodes, and = 106,544 DoFs. As pointed out, the boundary conditions impose cyclic symmetry at ±20°. For this case study, the computational effort for task (c) has been evaluated-evaluation of the flux density matrix Q. Here, the reference time is given by the solution evaluated with 10 nodes, 2 MPI tasks per node, and 1 OpenMP thread: such a time is 1932 s on the Marconi cluster. Table 3 shows the speed-up values as functions of the total nodes, of the OpenMP threads and of the MPI tasks. The speed-up value depends almost linearly by the number of MPI tasks and OpenMP threads. It is noteworthy that this performance is obtained with a proper handling of the memory burden, since the memory is equally partitioned between the nodes involved in the computation. From Table 3, we can see that using 40 nodes and 1 MPI task per node provides a lower speed-up with respect to using 20 nodes and 2 MPI tasks per node. This is due to the reduction in the amount of communication needed in the second case, as this amount depends on the granularity of the test case.

Case Study 2: Fusion Reactor Eddy Currents Analysis
Case study 2 is taken from a typical eddy current problem in nuclear fusion machines, associated with the fast electromagnetic transients produced by disruption of the confined plasma. The reference machine is the fusion reactor ITER [18]. By exploiting the symmetry and periodicity of the problem, the computational domain can be limited to a sector of 40 • of the tokamak (Figure 5a). The main conducting components falling into such a sector have been modelled, such as the vacuum vessel, the thermal shield, the coils, and the supports, etc. The 3D mesh consists of 80,573 elements, 138,762 nodes, and N do f = 106,544 DoFs. As pointed out, the boundary conditions impose cyclic symmetry at ±20 • .  Table 4 provides the computed costs of the assembly of L (task b). As a reference case, here we have assumed the choice of 22 OpenMP threads and 1 MPI task per node. The reference time is 16,064 s. Table 5 reports the speed up values for the inversion of Z (task d), with the reference solution given by 22 OpenMP threads and 2 MPI tasks per node, with 2 nodes. The reference time is here 907 s. Once again, the behavior of the speed-up scale is quite linear when increasing the number of nodes, with a limited degradation. Comparing the reference time for task b (assembly of L) and task d (inversion), we observe that this case study is another example where the inversion time is much smaller than the assembly one, as already pointed out in Section 2 and in discussing the benchmark case study. In addition, the numerical solution of problems such as those in case studies 1 and 2 exploits the symmetry of the machines, thus limiting the solution domain to a sector. Moreover, the periodic symmetric conditions require considering n replicas of the DoF during the assembly phase, thus increasing the corresponding time, without affecting the inversion time.   Table 4 provides the computed costs of the assembly of L (task b). As a reference case, here we have assumed the choice of 22 OpenMP threads and 1 MPI task per node. The reference time is 16,064 s. Table 5 reports the speed up values for the inversion of Z (task d), with the reference solution given by 22 OpenMP threads and 2 MPI tasks per node, with 2 nodes. The reference time is here 907 s. Once again, the behavior of the speed-up scale is quite linear when increasing the number of nodes, with a limited degradation. Comparing the reference time for task b (assembly of L) and task d (inversion), we observe that this case study is another example where the inversion time is much smaller than the assembly one, as already pointed out in Section 2 and in discussing the benchmark case study. In addition, the numerical solution of problems such as those in case studies 1 and 2 exploits the symmetry of the machines, thus limiting the solution domain to a sector. Moreover, the periodic symmetric conditions require considering n replicas of the DoF during the assembly phase, thus increasing the corresponding time, without affecting the inversion time.

Conclusions
A hybrid parallel computation approach implementing the joint use of MPI and OpenMP paradigms is here implemented and applied to challenging numerical magnetoquasi-static (MQS) problems. The approach has been used to parallelize the assembly of the inductance matrix: it has been shown that this task may be cost demanding as well as involving the task of matrix inversion, when high accuracy is required in the solution of MQS problems (for instance, when using 2 or more integration Gauss points). The pure MPI and the hybrid OpenMP-MPI approaches have been theoretically compared in terms of their computational performance in assembling such a matrix. The hybrid OpenMP-MPI approach saves resources (since it requires a lower number of nodes for a fixed computational time) and provides better performance (it requires lower times for fixed number of nodes). The analyzed case studies demonstrate speed-up values up to about 50×.
The proposed approach has a potential great impact on the efficient use of parallel computational resources, suggesting an efficient handling of tasks and threads, to avoid resources (cores and/or ram) being idle while the other processes are running elsewhere.