Parallel Multiphysics Simulation of Package Systems Using an Efﬁcient Domain Decomposition Method

: With the continuing downscaling in feature sizes, the thermal impact on material properties and geometrical deformations can no longer be ignored in the analysis of the electromagnetic compatibility or electromagnetic interference of package systems, including System-in-Package and antenna arrays. We present a high-performance numerical simulation program that is intended to perform large-scale multiphysics simulations using the ﬁnite element method. An efﬁcient domain decomposition method was developed to accelerate the multiphysics loops of electromagnetic– thermal stress simulations by considering the fact that the electromagnetic ﬁeld perturbations caused by geometrical deformation are small and constrained in one or a few subdomains. The multi-level parallelism of the algorithm was also obtained based on an in-house developed parallel infrastructure. Numerical examples showed that our algorithm is able to enable simulation with multiple processors in parallel and, more importantly, achieve a signiﬁcant reduction in computation time compared with traditional methods.


Introduction
Multifunctional integration and miniaturization are the developing directions of modern electronic systems [1][2][3]. These electronic systems are always integrated with high-density passive devices or interconnects, where signal delay, reflection, attenuation, and crosstalk have become the dominant factors limiting overall performance with the continuing downscaling in feature sizes and increasing in operating frequency [4][5][6][7]. When multiphysics field effects emerge, traditional simulations cannot adequately handle new challenges of various electromagnetic, thermal, and mechanical effects, especially the joule heating effects caused by increases of input power. Electromagnetic-induced heat generation and resulting elevated temperatures adversely affect the performance and reliability of these devices. All these challenging issues are pushing the transition from traditional electromagnetic simulations to multiphysics simulations of package systems. Multiphysics analysis tools of commercial software, like ANSYS WORKBENCH and COMSOL, can handle the multiphysics simulations of electronic elements like antennas or waveguides but not for electronic systems. Related literature on efficient system-level multiphysics modes [8,9], where electromagnetic models and thermal resistance models were developed, has been published. However, neither the size effects of package systems with high operating frequency nor the mechanical effects have been considered during multiphysics simulations.
Full-wave electromagnetic simulations of such systems have been conducted by numerical methods and their parallel algorithms such as the finite element method (FEM) [10], the method of moments (MoM) [11], and the finite difference time domain (FDTD) method [12]. However, for practical package systems, these structures pose immense challenges for existing numerical methods like MoM and FDTD. Many papers have been published with the FEM thanks to its flexibility in geometric and material manipulation. Considering the ill-conditioning feature of the FEM matrix in electromagnetic simulation, the recently developed domain decomposition method (DDM) [13][14][15], iterative class and hierarchical matrix method [16], and direct class methods are thought to provide effective ways to conquer the solving problems of FEM matrix equations. A previous study [14] proposed a double-level parallel scheme of the FEM to achieve the solving of over three billion unknowns on 9600 CPU cores, though only for electromagnetic simulation. Additionally, many investigations have been done to numerically analyze multiphysics effects on the operating state of package systems [17,18], including this paper's authors' recent work [19]. However, that work [19] did not consider thermal or mechanical impact on electrical performance. Furthermore, even with the newly developed methods, hundreds of iterations are always needed in large-scale problems, which makes it difficult to conduct the multiphysics simulation of real-life package systems, where electromagnetic solving always takes a couple of hours and is unaffordable.
Considering the fact that similar problems are solved several times during multiphysics simulations, it is necessary to improve the efficiency of repeated computing where there are mesh deformation/morphing techniques [20][21][22]. In this approach, an existing mesh can be smoothly transformed to conform the geometrical modification caused by thermal strains. However, a linear system requires solving to update the mesh. Considering that a simulated system is electrically small within the interested frequency band, the authors of [23] suggested recycling the Krylov subspace to reduce the total number of iterations. A novel embedded DDM was also presented for repeated electromagnetic modeling in designs while considering the geometrical and material differences between subdomains [24]. In fact, the embedded DDM is a kind of method that employs the background field as the initial guess of new linear-system, and a large number of unknowns are introduced because there are overlapping areas between subdomains and background domain.
In this paper, we propose an efficient parallel multiphysics simulation strategy. A nonoverlapping DDM was used to compute an electromagnetic field, and then thermal and stress computations were conducted to update the temperature, material properties, and geometrical deformations, thus forming a multiphysics simulation loop. Here, instead of a background electromagnetic field, an efficient interpolation by means of subdomain boundary values could be used to obtain much faster convergence of electromagnetic computing, which finally accelerated the whole multiphysics simulations by several times. Mathematically, this method is an efficient implementation of MOR (mode order reduction) by means of the DDM, and the motivation of our work came from the fact that the change of the solving/searching space is quite small during repeated computing. Physically, the geometrical deformation of package systems has little influence on the whole electromagnetic distribution but does have influence on the regions surrounding them. High-order modes of electromagnetics can only propagate across few adjacent sub-domains. Therefore, subdomain boundary values can give a good estimation of change. Furthermore, the multi-level parallelism of the algorithm were also obtained based on an in-house developed infrastructure JAUMIN (J Parallel Adaptive Unstructured Mesh Applications Infrastructure) [25]. Numerical examples with millions of unknowns are presented to prove our theorem and show the performance of the proposed parallel algorithm.
The remainder of the paper is organized as follows: Section 2 describes the mathematical model of our multiphysics problems. Section 3 presents the parallel implementation including an in-house developed parallel infrastructure and a parallel scheme of the multiphysics simulation method. A validation example and two real-life numerical examples of the proposed work are discussed in Section 4. Section 5 concludes the paper.

Boundary Value Problem of Electromagnetic Field
We start by reviewing the Helmholtz equation defined in an open region Ω. A Silver-Müller radiation condition is used to truncate the infinite domain. This leads to the following problem: where E is the electric field with a unit of V/m, J is the current density with a unit of A/m 3 , ε r = ε/ε 0 is the relative permittivity, and µ r = µ/µ 0 is the relative permeability with ε 0 = 8.85 × 10 −12 (F/m) and µ 0 = 4π × 10 −7 (H/m). ω is the angular frequency with a unit of rad/s, k = ω √ µ 0 ε 0 is the wavenumber, and j = √ −1 is the imaginary unit. The freespace intrinsic impedance is denoted as η = µ 0 /ε 0 , and the tangential trace and the tangential component trace operators for any vector v are given by refers to the associated field jump across a surface. An efficient DDM solver is used to solve the electromagnetic problems, and its implementation is described in the following section.

Electromagnetic-Thermal Stress Coupling
Consider a domain Ω in which an electromagnetic field is calculated; the governing equations for the steady thermal system are given by where T is the temperature with a unit of K, κ is the thermal conductivity with unit a unit of W/(K · m), and h is the thermal convection coefficient with a unit of W/ K · m 2 . The heat source represents the electromagnetic losses Q e W/m 3 and is given by where the resistive losses are and the magnetic losses are where H is the magnetic field with a unit of A/m and B is the magnetic flux density with a unit of T. In addition, this equation maps the electromagnetic surface losses of the conductors as a heat source on the boundary where J S represents the surface electric current and Z S represents the surface impedance. The governing equation of the thermal stress simulation was derived according to both Hooke's and Newton's laws, as given by in which σ = σ x σ y σ z σ xy σ yz σ zx T is the stress vector and f is the volume force. U = u x u y u z T , and u x , u y , and u z are the structural displacements at the x, y, and z directions, respectively.
[∂] is a differential operator matrix defined as D is the elastic constant tensor with element and in Equation (7), is the thermal strain, in which α is named as the coefficient of thermal expansion with a unit of 1/K, and ∆T is the temperature variation obtained by solving the heat conduction equation.

DDM Solver and Multiphysics Simulations
To directly implement the DDM, a generalized Schwarz algorithm of Equation (1) is used for two subdomains Ω = Ω 1 ∪ Ω 2 ; it then solves them in an iterative way for n = 1, 2, · · · , and N, respectively, i.e., where Γ 12 = ∂Ω 1 ∩ Ω 2 , Γ 21 = ∂Ω 2 ∩ Ω 1 , T n j represents the transmission condition, and n j denotes the unit outward normal vector of the subdomain Ω j . The transmission condition can be written as (15) where S TM = ∇ τ ∇ τ ·, S TE = ∇ τ × ∇ τ × ·, τ denotes the tangential direction, and δ l (l = 1, 2, 3, and 4) are the parameters for different transmission conditions. The DDM with non-overlapping region requires a matching grid between two adjacent subdomains. The computational domain is at first meshed and decomposed into M equivalent-size subdomains automatically by the in-house developed parallel infrastructure JAUMIN. Each subdomain has a ghost region that extends several grids into its neighbor-Electronics 2021, 10, 158 5 of 14 ing domains. The ghost region can facilitate the implementation of the DDM. On each subdomain, we define discrete trial and test functions as u h m , v h m ∈ X h m ⊂ H(curl; Ω m ); when order p = 0, there are 6 vector basis functions within each tetrahedron. The final weak form of the DDM with the Robin transmission condition can be interpreted by the following equations, with the absorbing boundary condition implementation neglected: Here, we define the volume and surface integral forms by (u, v) Ω := Ω u · vdv and u,v ∂Ω := ∂Ω u · vds, where the over bar denotes the conjugation. The parameters in Equation (13) are set to be δ l = 0 (l = 1, 2, 3, and 4). The unit normal vector n of the interface boundary Γ ij is pointing away from the subdomain Ω i . The matrix equation resulting from the above formulas can be further written in a compact form, and for the case of two subdomains, we have where the matrices A 1 and A 2 are subdomain matrices respectively following the two interface coupling matrices C 12 and C 21 .
It is instructive to first consider the thermal strain in Equation (10) of a material with a maximum thermal expansion, which is usually copper with α = 1.65 × 10 −5 in package systems. Thus, an estimation of 1.0 × 10 −3 m of the largest geometrical deformation is reasonable during the operating state of systems. This geometrical deformation is far less than the wavelength of interest. Therefore, geometrical deformation has little influence on the electromagnetic distribution of a whole region but does have influence on those subdomains surrounding the deformation. This is mainly due to the fact that the high-order modes of electromagnetic interference can only propagate across few adjacent sub-domains. As a result, the DDM can get very fast convergence with the interpolation of subdomain boundary values. The strategy of the DDM accelerating multiphysics simulation is illustrated as Algorithm 1.

Parallel Infrastructure and Parallel Algorithm
JAUMIN is an in-house developed programming framework for massively parallel numerical simulations on unstructured meshes. It facilitates the development of large-scale parallel unstructured-mesh applications, and it supports applications to scale to tens of thousands of CPU cores. As shown in Figure 1, the geometry is discretized into a mesh level and then subdivided into subdomains. These subdomains are assigned to computing nodes of the supercomputer and exchange data through explicit MPI calls. Within each of nodes, the domain is further divided into smaller sub-domains called regions. The regions are bound to CPU sockets and communicate with each other through non-uniform memory access (NUMA)-aware message passing or multi-threading. In each of CPU sockets, the region is finally divided into patches. Usually, one patch is assigned to one core. The data communications among patches are implemented by multi-threading. In the CPU core, the mesh cells are transferred into the private cache, and SIMD parallelization can be implemented.
The data communications among patches are imple CPU core, the mesh cells are transferred into the priva can be implemented.
The proposed multi-level parallel scheme of Algo indicated above, after reading the mesh, JAUMIN gen utes them to every process according to the inherent make sure that parallel processes do equal amounts o addressed with dynamic scheduling by defining a m subdomains assigned to each process. The DDMSolv red line. The subsystem in a patch is solved individua ized among the threads. Once the electromagnetic s coupling system is constructed and solved by calling t

Process #1
Process #2 Process #3 Process #4 Figure 1. Nested mesh hierarchy model. The proposed multi-level parallel scheme of Algorithm 1 is presented in Figure 2. As indicated above, after reading the mesh, JAUMIN generates patch hierarchies and distributes them to every process according to the inherent workload distribution scheme. To make sure that parallel processes do equal amounts of work, spatial locality can also be addressed with dynamic scheduling by defining a minimum number of the successive subdomains assigned to each process. The DDMSolver in Algorithm 1 is marked with a red line. The subsystem in a patch is solved individually by a direct method and parallelized among the threads. Once the electromagnetic solving is finished, a thermal stress coupling system is constructed and solved by calling the parallel linear solver.

Numerical Examples
We now examine the performance of our proposed multiphysics algorithm by validating its efficiency. During all the numerical experiments, the stopping criterion of the electromagnetic DDMSolver was set to be the residual reduced by a factor of 1.0 × 10 −3 if no particular indication was used. We note that a maximum number of 16 multiphysics loops were always performed in our program to make sure the temperature change between two adjacent multiphysics loops was less than 0.1 K; thus, the convergence of multiphysics simulation was achieved. Furthermore, the multiphysics system was built in parallel, and a parallel Portable, Extensible Toolkit for Scientific Computation (PETSc) library [26] was employed to solve the linear system. Specifically, the sub-linear systems of the DDM were solved by the PETSc library associated with the MUltifrontal Massively Parallel Sparse Direct Solver (MUMPS) [27]. The linear systems of thermal stress coupling systems were solved by a GMRES solver with an AMG preconditioner from the HYPRE library [28].
A type of computer platform with NUMA architecture was used to acquire the simulated results, i.e., TS10K cluster, which is designed and manufactured by INSPUR Co., Ltd. JINAN, CHINA. This platform is equipped with 57 compute nodes and each one has two CPUs with the following specifications: Intel Xeon E5-2692V2 (2.20GHz/12cores)/8.0GT/30ML3/1866, 32G ECC Registered DDR3 1600 RAM, Linux Enterprise 6.3 X64 with kernel 2.6.32-279.e16.x86_64.

Numerical Examples
We now examine the performance of our proposed multiphysics algorithm by validating its efficiency. During all the numerical experiments, the stopping criterion of the electromagnetic DDMSolver was set to be the residual reduced by a factor of 1.0 × 10 −3 if no particular indication was used. We note that a maximum number of 16 multiphysics loops were always performed in our program to make sure the temperature change between two adjacent multiphysics loops was less than 0.1 K; thus, the convergence of multiphysics simulation was achieved. Furthermore, the multiphysics system was built in parallel, and a parallel Portable, Extensible Toolkit for Scientific Computation (PETSc) library [26] was employed to solve the linear system. Specifically, the sub-linear systems of the DDM were solved by the PETSc library associated with the MUltifrontal Massively Parallel Sparse Direct Solver (MUMPS) [27]. The linear systems of thermal stress coupling systems were solved by a GMRES solver with an AMG preconditioner from the HYPRE library [28].
A type of computer platform with NUMA architecture was used to acquire the simulated results, i.e., TS10K cluster, which is designed and manufactured by INSPUR Co., Ltd. JINAN, CHINA. This platform is equipped with 57 compute nodes and each one has two CPUs with the following specifications: Intel Xeon E5-2692V2 (2.20GHz/12cores)/8.0GT/ 30ML3/1866, 32G ECC Registered DDR3 1600 RAM, Linux Enterprise 6.3 X64 with kernel 2.6.32-279.e16.x86_64.

Numerical Validation
Our beginning simulation was to validate the accuracy of our program using an antenna element with an operating frequency of 3.0 GHz and an input power of 50 W. A number of 503,370 grids were generated, and 12 CPU cores were used for 12 subdomains to process the DDM iterations. Specifically, absorbing boundary conditions for the electromagnetic simulation were applied, and a wave port was used to excite the antenna. For the thermal simulation, all boundaries exterior to the air were set as thermal convection boundaries with h = 5 W/ K · m 2 and T a = 298.15 K. For the stress simulation, the bottoms of the antennas were fixed with U = 0 0 0 T . The calculated highest temperature had a good agreement with that of the commercial software ANSYS, as listed in Table 1. Figure 3 shows the geometrical details and calculated temperature distribution.

Numerical Validation
Our beginning simulation was to validate the accuracy of our program using an antenna element with an operating frequency of 3.0 GHz and an input power of 50 W. A number of 503,370 grids were generated, and 12 CPU cores were used for 12 sub-domains to process the DDM iterations. Specifically, absorbing boundary conditions for the electromagnetic simulation were applied, and a wave port was used to excite the antenna. For the thermal simulation, all boundaries exterior to the air were set as thermal convection . The calculated highest temperature had a good agreement with that of the commercial software ANSYS, as listed in Table 1. Figure 3 shows the geometrical details and calculated temperature distribution.

Multiphysics Simulation of Real-Life System-In-Package
As shown in Figure 4, we then considered a real-life System-in-Package. It was stacked with 12 boards consisting of high-density interconnects and a total number of 4800 entities, with a minimum size of 10 μm and a whole size of over 10 mm. A number of 5,830,000 girds were generated, and 140 CPU cores were used to perform the multiphysics simulations. Specifically, absorbing boundary conditions for electromagnetic simulation

Multiphysics Simulation of Real-Life System-In-Package
As shown in Figure 4, we then considered a real-life System-in-Package. It was stacked with 12 boards consisting of high-density interconnects and a total number of 4800 entities, with a minimum size of 10 µm and a whole size of over 10 mm. A number of 5,830,000 girds were generated, and 140 CPU cores were used to perform the multiphysics simulations. Specifically, absorbing boundary conditions for electromagnetic simulation were applied, and a lumped port was used to excite the simulation. For the thermal simulation, all boundaries exterior to the air are set as thermal convection boundaries with h = 5 W/ K · m 2 and T a = 298.15 K. For the stress simulation, the bottom of the substrate was fixed with U = 0 0 0 T . were applied, and a lumped port was used to excite the simulation. For the thermal simulation, all boundaries exterior to the air are set as thermal convection boundaries with  The calculated results are summarized in Table 2. The multiphysics simulation procedure and the calculated maximum temperature values of two simulation cases are demonstrated in Figures 5 and 6, respectively. The convergence of the S parameters is shown in Figure 7, and this parameter was defined as in in where V is the voltage of the port. It is worth noting that this real-life System-in-Package cost the DDM-Solver 468 iterations to converge at the first multiphysics simulation loop. Fortunately, the DDM iteration number dropped quickly for both cases with different microwave input power values. The results in Figures 5 and 6 show that the algorithm performance did not degrade with larger geometrical deformation, and the total consuming time is shown in Table 2, with the last 15 multiphysics simulation loops only taking about 35% of the total computing time. It is reasonable to note that the slower convergence was always with the larger input power, i.e., larger temperature variation and larger geometrical deformations. A deviation of 0.1 dB of the two cases can be seen when comparing Figures 5 and 6. It was necessary to increase the input power, e.g., from 10.0 to 50.0 W with a step of 10.0 W, during the multiphysics simulations. The distributions of the electric, temperature, and strain fields are illustrated in Figure 8.  The calculated results are summarized in Table 2. The multiphysics simulation procedure and the calculated maximum temperature values of two simulation cases are demonstrated in Figures 5 and 6, respectively. The convergence of the S parameters is shown in Figure 7, and this parameter was defined as S = V − V in /V in , where V is the voltage of the port. It is worth noting that this real-life System-in-Package cost the DDMSolver 468 iterations to converge at the first multiphysics simulation loop. Fortunately, the DDM iteration number dropped quickly for both cases with different microwave input power values. The results in Figures 5 and 6 show that the algorithm performance did not degrade with larger geometrical deformation, and the total consuming time is shown in Table 2, with the last 15 multiphysics simulation loops only taking about 35% of the total computing time. It is reasonable to note that the slower convergence was always with the larger input power, i.e., larger temperature variation and larger geometrical deformations. A deviation of 0.1 dB of the two cases can be seen when comparing Figures 5 and 6. It was necessary to increase the input power, e.g., from 10.0 to 50.0 W with a step of 10.0 W, during the multiphysics simulations. The distributions of the electric, temperature, and strain fields are illustrated in Figure 8.

Multiphysics Simulation of Antenna Array and Radome
Our last numerical example was the multiphysics simulation of a real-life antenna array, as shown in Figure 9a, which consisted of 33 antenna elements and a multilayer radome attached by a frequency selective surface. Each antenna element was operating at a frequency of 2.6 GHz with an input power of 100.0 W. The geometrical scale reached a magnitude of three orders, leading to a total number of 96,800,000 grids and 125,846,630 unknowns for electromagnetic simulations, 12,883,936 unknowns for thermal simulations, and 38,651,808 unknowns for stress computing. We used 196 CPU cores for 3072 subdomains to perform electromagnetic DDM iterations. Specifically, absorbing boundary conditions for electromagnetic simulation were applied, and 33 wave ports were used to excite the simulation. For the thermal simulation, all boundaries exterior to the air were set as thermal convection boundaries with h = 5 W/ K · m 2 and T a = 298. 15   dB of the first element were achieved. Additionally, the distributions of electric, temperature, and strain fields of the antenna array, together with the radome, are illustrated in Figure 9b-d, respectively.

Conclusions and Future Works
In this paper, we developed an efficient algorithm that could dramatically speed up multiphysics simulations by reusing of the solving space of the DDM. The multi-level parallelism of the proposed method was obtained based on an in-house developed parallel infrastructure called JAUMIN, which is able to solve large-scale problems with millions of unknowns. The optimal simulation efficiency was characterized by several numerical experiments, with a dramatic reduction of time-consumption compared with traditional methods.
Simulations of large-scale problems with over 100 million unknowns are needed to verify the advantages of the proposed method, and more research is needed to further reduce the execution time for larger problems, especially that of the first multiphysics loop. To do this, higher order transmission boundary conditions of the DDM can be developed to accelerate the convergence rate of electromagnetic solving. Moreover, investigations into extreme thermal conditions must be conducted when the temperature rise is significant and the geometrical deformation can be as large as 10 × 10 −2 m. It was also indicated that with this strategy, more comprehensive design methods, especially geometrical optimization and frequency sweeping methods, can be further developed. The multiphysics simulation information is illustrated in Table 3, and the simulation procedure is demonstrated in Figure 10. The simulation cost the DDMSolver 304 iterations to converge at the first multiphysics loop, and then the number of DDM iteration steps dropped quickly to around 5. The total consuming time was 64,965 s with the first loop of 58,468 s, and 935,468 s of 16 multiphysics loops were consumed when we used traditional methods. Thus, it is reasonable to evaluate our method as being 10 times faster than the traditional method. A final maximum temperature of 499.0 K and an S parameter of -8.66 dB of the first element were achieved. Additionally, the distributions of electric, temperature, and strain fields of the antenna array, together with the radome, are illustrated in Figure 9b-d, respectively.

Conclusions and Future Works
In this paper, we developed an efficient algorithm that could dramatically speed up multiphysics simulations by reusing of the solving space of the DDM. The multi-level parallelism of the proposed method was obtained based on an in-house developed paral-

Conclusions and Future Works
In this paper, we developed an efficient algorithm that could dramatically speed up multiphysics simulations by reusing of the solving space of the DDM. The multilevel parallelism of the proposed method was obtained based on an in-house developed parallel infrastructure called JAUMIN, which is able to solve large-scale problems with millions of unknowns. The optimal simulation efficiency was characterized by several numerical experiments, with a dramatic reduction of time-consumption compared with traditional methods.
Simulations of large-scale problems with over 100 million unknowns are needed to verify the advantages of the proposed method, and more research is needed to further reduce the execution time for larger problems, especially that of the first multiphysics loop. To do this, higher order transmission boundary conditions of the DDM can be developed to accelerate the convergence rate of electromagnetic solving. Moreover, investigations into extreme thermal conditions must be conducted when the temperature rise is significant and the geometrical deformation can be as large as 10 × 10 −2 m. It was also indicated that with this strategy, more comprehensive design methods, especially geometrical optimization and frequency sweeping methods, can be further developed.