Diagonalized Macromodels in Finite Element Method for Fast Electromagnetic Analysis of Waveguide Components

A new technique of local model-order reduction (MOR) in 3-D finite element method (FEM) for frequency-domain electromagnetic analysis of waveguide components is proposed in this paper. It resolves the problem of increasing solution time of the reduced-order system assembled from macromodels created in the subdomains, into which an analyzed structure is partitioned. This problem becomes particularly relevant for growing size and count of the macromodels, and when they are cloned in multiple locations of the structures or are used repeatedly in a tuning and optimization process. To significantly reduce the solution time, the diagonalized macromodels are created by means of the simultaneous diagonalization and subsequently assembled in the global system. For the resulting partially diagonal matrix, an efficient dedicated solver based on the Schur complement technique is proposed. The employed MOR method preserves frequency independence of the macromodels, which is essential for efficient diagonalization, as it can be performed once for the whole analysis bandwidth. The numerical validation of the proposed procedures with respect to accuracy and speed was carried out for varying size and count of macromodels. An exemplary finite periodical waveguide structure was chosen to investigate the influence of macromodel cloning on the resultant efficiency. The results show that the use of the diagonalized macromodels provided a significant solution speedup without any loss of accuracy.


Introduction
Waveguide components, such as filters, diplexers, power dividers/combiners, junctions, resonators, etc. have always been used in microwave technique.Although being gradually replaced by monolithic and hybrid integrated circuits, they are still indispensable in space applications or in millimeter-wave bands due to their low losses and high power handling capabilities.Since waveguide components have the size comparable to the wavelength and cannot be described by mens of currents and voltages, they can be accurately characterized only in rigorous full-wave electromagnetic analysis.The finite element method (FEM) is widely recognized for its ability to accurately solve frequency domain electromagnetic problems governed by the Maxwell equations in structures having arbitrary geometry, such as waveguide components [1].It is also known that in complex 3-D domains, especially comprising small geometric features, the discretization mesh may grow dramatically, leading to very large systems of linear equations [2].With rapidly increasing number of unknowns, they become very costly to solve in terms of both memory demand and computational time.An extensive mesh refinement is needed to conform the discretization mesh to both complex shapes and strong field variations within the structure.However, the resulting number of unknowns, regarded as the degrees of freedom (DOF) or the states of the FEM model, exceeds largely the necessary number of states of a model, which could be created to capture accordingly the electromagnetic behavior of the structure with respect to its external ports only.Such a reduced-order black-box model, meant to surrogate the original FEM problem, is constructed in a process called model-order reduction (MOR).It adopts the approach, formalism and selected techniques used in MOR for the linear time-invariant systems (LTI) [3] that have been widely employed in the analysis of large RLC networks (consisting of resistors, inductors and capacitors) [4][5][6].
This paper deals with MOR performed locally in subregions, into which the analyzed structure is partitioned.The resulting reduced-order models are called macromodels and this technique is referred to as macromodeling [7,8].The macromodels for each of the subregions are subsequently combined in the final system representing the whole structure.The advantage of macromodeling over the model-order reduction in the entire domain is that the matrix problems solved during the generation of macromodels are smaller and the reduction orders can be decreased selectively in each subdomain, providing an essential reduction of memory demand and computation time.In this respect, macromodeling can be considered as a combination of MOR and domain decomposition [9] and it shares the benefits of both.
One of the earliest works on macromodeling in FEM is presented in [10], in which macromodels, also called macro-elements, are developed in selected regions only in order to cover small geometric features requiring strong mesh refinement.Each macromodel represents a transfer functions between the electric and magnetic fields on its boundary and is built in the form of general impedance matrix (GIM), order reduced by means of PRIMA (passive reduced-order interconnect macromodeling algorithm) [11].A formulation of macromodels for fully segmented structures is proposed in [12] as a technique called SFELP (segmentation approach/finite elements/Lanczos-Pade).The Pade approximation via the Lanczos process (PVL) [13] is employed to build the macromodels defined as general admittance matrices (GAM).Based on the same MOR framework, a method of direct decomposition is developed in [14].Although in the aforementioned methods the most time consuming computations are performed once in an analysis bandwidth, allowing for fast frequency sweeping (FFS), the GIM and GAM macromodels are frequency dependent.Consequently, when used in eigenproblems, these methods require computationally demanding nonlinear eigensolvers.
The MOR employed in this paper is based on the technique that alleviates this problem.It is proposed in [15] for 2-D problems and extended to a 3-D formulation in [16].Unlike the reduction of GIM and GAM, this MOR procedure is applied to a transfer function between fields on the boundary of macromodel region and in its interior.Each macromodel is represented by a couple of matrices related to the mass and stiffness matrix of the FEM system.The whole MOR process is performed for just one expansion frequency and, what is more, does not introduce any frequency dependent terms in the macromodel matrices.Besides facilitating the solution of eigenproblems, this approach enables very efficient FFS, especially if a subsequent diagonalization of macromodels is intended, because it can be carried out entirely outside the frequency loop of the analysis bandwidth.
A very important advantage of macromodeling arises when optimization is employed in a design process.It involves a sequence of repeated simulations preceded by some structure modifications, which are usually carried out within small subregions that may be represented by macromodels.As presented in [14,17,18], in each step, only a single macromodel that is in the subregion being currently modified needs to be re-generated, while the rest remain unchanged.Since the generation of macromodels is the most costly part of the simulations, the total optimization time can be significantly reduced.Regarding this as a temporal macromodel reuse, one may also consider a similarly beneficial spatial reuse.That strategy is called macromodel cloning and is proposed in [10].If the analyzed structure comprises repeated subregions of the same geometry and materials, the macromodels need to be created only once for each group of them and they will be multiply copied in the final system.
The macromodels are represented by dense matrices being significantly smaller than their corresponding sparse FEM matrix blocks.Therefore, the overall time of the solution obtained by means of MOR is much shorter than in the original FEM problem, so that most of the computation cost is shifted to the preparation of macromodels itself.However, the residual solution cost of the final reduced-order system becomes significant in larger structures with increasing number of macromodels.Moreover, if the subdomains are large, the macromodels require higher reduction orders, which immediately increases their size and additionally slows down the solution.Consequently, the solution time becomes comparable to the reduction time and may even largely exceed it, particularly if any form of macromodel reuse is applied, as in the case of cloning or optimization.Another situation when the solution time of the reduced system may be comparable to the time of MOR is an automatic selection of reduction order q [19], in which the system is solved repeatedly at each try of increasing value of q.
To mitigate these effects, a new approach to the model-order reduction in FEM problems is proposed in this paper, which leads to the final system with a diagonal matrix, and thus significantly accelerates its solution.The existing diagonalization procedures are not satisfactory for the reasons explained below.A direct diagonalization of the original global FEM system involving orthogonal decomposition of large sparse matrices is possible but practically useless, because it would produce intermediate matrices that are equally large but dense, causing huge memory demand.Among the techniques to overcome this problem, the mass lumping [20] is most popular, but originally restricted to FEM in the time domain.As performed element-wise, the diagonalization is very efficient but limited to the mass matrix only.This is sufficient, however, in the time domain, because it is only the mass matrix that needs to be solved at every time step.An attempt to adopt the mass lumping technique in the frequency domain has been very recently reported in [21], where the time-domain FEM formulation with diagonalized mass matrix is transformed directly to the frequency domain only in order to improve convergence of the iterative solution to the FEM system by efficiently limiting its spectral radius to less than one.It results in good convergence speedup and accuracy, but has been demonstrated for a very narrow bandwidth, limited to less than 1% only.
The novel procedure proposed in this paper is intended for direct solution of the system of equations which is more suitable for the reduced system consisting of small and dense matrices.It brings a substantial advantage of diagonalization to the frequency-domain FEM analysis by adopting the simultaneous diagonalization of mass and stiffness matrix in the model-order reduction framework.To maintain the diagonalization cost as low as possible, it is carried out in the subdomains after the local MOR.As the macromodel matrices are already dense and small, the problem of excess memory demand does not occur.Owing to the accordingly chosen MOR technique, they are also frequency independent, and therefore the diagonalization is needed only once in the whole bandwidth.The resulting diagonalized macromodels are combined in the final system in a way that allows for cloning.Although only a part of the resulting system matrix becomes diagonal, it dominates so that a dedicated solver, efficiently adopting the Schur complement technique, is proposed to achieve additional solution speedup.Apart of the above mentioned main novelty aspects, a new simultaneous diagonalization algorithm, alternative to that of Laub [22], is proposed.Moreover, three different methods of matrix orthogonal decomposition are considered for the simultaneous diagonalization and compared to propose an optimal combination of them with respect to diagonalization time.
The outline of the rest of the paper, which also reflects the sequence of steps in the proposed procedure, is as follows.Section 2 starts with a brief formulation of FEM for the wave equation as a background of the presented analysis.Then, the procedure of domain partitioning is presented in the perspective of the subsequent MOR and diagonalization in separate subdomains.The final assembly of the global system is not included, as it will be done only after diagonalization.Section 3 presents the port compression and MOR techniques adopted from Fotyga et al. [16] in such a way that the macromodels diagonalized in the next step will be disconnected and thus ready for cloning.In the end of this section, it is depicted how the diagonalized macromodels are created by means of the simultaneous diagonalization algorithms and subsequently assembled in the global reduced system.A dedicated solver based on the Schur complement technique for the resulting partially diagonal system is also proposed.The results of numerical validation with respect to accuracy and speed are presented and discussed in Section 4.

FEM Formulation
Consider a waveguide component being a source-free arbitrarily shaped region Ω bounded by a surface Γ. Ω is loaded with lossless media of arbitrary distribution of electric permittivity ε and magnetic permeability µ.The time-harmonic electric and magnetic fields E and H at angular frequency ω in Ω are governed by the Maxwell equations: For the finite element method used in the presented analysis, they are transformed to the vector wave equation with respect to E: In the boundary-value problems with excitations, which is the case in this analysis, Γ comprises N Pn input/output ports denoted as P n .They are referred to as external ports, on which the fields related to excitations and loads are defined as the boundary conditions.On the remaining parts of Γ, denoted as Γ 0 , perfect electric conductor (PEC) is assumed.
In the finite element method based on Galerkin approach, the following weak formulation is derived from Equation (2) and the boundary condition in the external ports [12,23]: where n n is a normal vector on the nth port, H tn is the distribution of the tangential magnetic field on the nth port, µ r , ε r are relative permeability and permittivity, k 0 = ω √ µ 0 ε 0 is the wavenumber, and w is a vector testing function.The right-hand side represents excitation of the analyzed structure.Distribution of the electric field E is sought as an approximate solution to Equation (3) in a finite dimensional subspace of vector basis functions w i .The testing functions w are chosen to be the same as these basis functions, and thus the electric field E is approximated by the following general expansion: where e i are the coefficients being the unknowns in the linear system of N equations to which the FEM procedure eventually leads.The basis functions w i are defined piecewise in finite elements, into which the entire domain is divided.Usually tetrahedrons are used as they are the simplest shapes able to approximate arbitrarily complex 3-D geometries.Further steps of the FEM procedure begin thus with discretization of Ω by means of a 3-D tetrahedral mesh.At the same time, a 2-D triangular mesh is also defined as a collection of faces of the tetrahedrons adjacent to all surfaces present in the domain.They include the surfaces of all physical features and also any fictitious boundaries between subdomains into which Ω will be partitioned.The first-order finite elements are assumed in this analysis, however, the proposed technique with diagonalized macromodels is also applicable to higher-order FEM.In a single tetrahedron, six first-order vector basis functions are associated with the six edges and are defined as: where m is an edge number, α i and α j are the simplex coordinates associated with nodes i and j of the edge m, and L m is the length of this edge [24].Equation (4) splits into separate tetrahedrons Ω (t) and 3-D electric field vectors in each of their volumes are approximated as: This expansion can also represent 2-D distribution of the tangential field components on surfaces if three-element subsets of w m for the corresponding triangular faces of tetrahedrons are considered.
Substituting the field expansion in Equation ( 6) into Equation (3) yields the set of linear systems of equations for a collection of tetrahedrons Ω (t) , which can be rewritten in matrix form: where e is the element vector of unknown coefficients e m in Equation (6).The element stiffness and mass matrices K (t) and M (t) , respectively, and the right-hand side vector of excitation b (t) are given by the following integrals: where i, j = 1, 2, . . .6 are the matrix indices corresponding to tetrahedron edges, w i , w j are the testing functions as defined in Equation (5), n is a unit normal vector, and S n are the faces of tetrahedrons that belong to the external port P n in which the distribution of tangential magnetic vector H tn is defined.For the elements not adjoining any of the external ports, b (t) = 0.

Domain Partitioning
To create M macromodels in the process of model-order reduction, Ω is divided into M non-overlapping subdomains Ω k , as shown in Figure 1a for M = 3.They are separated by the interface surfaces P m , which we call internal ports.Although P m are internal in Ω, their edges may connect to Γ 0 .The internal ports are not physical boundaries but only fictitious geometrical object introduced in order to partition the domain.Therefore, the only boundary conditions to be considered on the P m express the field continuity, exactly as it is on the interfaces between individual finite elements.The mesh can be generated according to two different strategies: in a single run for the whole structure or separately in the subregions.In the former, the triangular mesh on the internal ports as well as the FEM basis functions on their tetrahedron faces are naturally shared by the adjacent subdomains.Consequently, the corresponding expansion coefficients of the fields are the same on P m , which guarantees continuity of the tangential electric vectors E t .In addition, the normal components of the electric induction D n = ε E n remain continuous on the internal ports, which is due to the natural boundary conditions at the interfaces between tetrahedrons inherently present in the FEM weak formulation.To proceed with FEM and MOR separately in each subdomain, the internal ports have to be split, as shown in Figure 1b, in such a way that two identical copies of the surface mesh are assigned to the disconnected subdomains.Despite this, the aforementioned boundary conditions for the continuity of the fields will be fulfilled also on the separated interfaces without the need to impose them explicitly, because the internal ports will eventually reconnect during the assembly of the final system of equations.
The second meshing approach is preferred when cloning of macromodels is intended.The subdomains are separated at the internal ports before mesh generation.However, entirely independent 3-D meshing in each subregion would result in non-conforming 2-D meshes to meet when the ports are reconnected.To avoid such situations and have the boundary conditions fulfilled as straightforwardly as in the first strategy, the procedure is modified and divided into two steps.In the initial step, only 2-D mesh is created on all surfaces in the entire structure.If needed, the meshes on the selected internal ports are copied into the places where the cloned macromodels are to be embedded later.The mesh on the internal ports is duplicated to build separate boundary of 2-D mesh enclosing each unique subregion.The other subregions are regarded as non-unique, and do not require 3-D meshing, because the macromodels will be cloned into them.Next, these surface meshes are used to initiate generation of 3-D mesh in the enclosed volumes.This procedure can be easily applied using the advancing front technique of mesh generation implemented in the NETGEN [25] software that is chosen for this analysis.

Local Matrix Assembly
The element matrices and vectors in Equation ( 7) are assembled over each disconnected subdomain Ω k separately, leading to the following M local FEM matrix equations with large sparse matrices: for k = 1, 2, . . .M. The term local refers here to the subdomains, not just individual finite elements, and is used to oppose to a global system of equations in standard FEM procedure for an unpartitioned domain.Although not necessary in the proposed macromodeling technique, the FEM global system can be assembled from Equation ( 9) as well.
To proceed with subsequent model-order reduction, the local FEM matrices in Equation ( 9) are reordered in such a way that the unknowns associated with all subdomain's port are grouped together.As a result, the matrices in Equation ( 9) split into the following blocks: The matrices K Pk , S Kk , M Pk , and S Mk , being a common representation of all ports on the boundary of Ω k (∂Ω k ) combined together, split further into the blocks corresponding to each separate port P ik in the following way: where the indices ik correspond to a port whose global number is i and which belongs to the boundary ∂Ω k .The port numbers i represent a consecutive global numbering of the external and internal ports P n and P m such that i ∈ {n, m}, where n = 1, 2, . . .P 0 and m = P 0 +1, . . ., P. For i = m, the internal port P m splits into two ports P ik1 and P ik2 associated with the adjacent subdomains Ω k1 and Ω k2 , respectively.For i = n, an external port P n remains assigned to a single subdomain so that P ik ≡ P n .Using the definitions of matrix blocks in Equation (11), Equation ( 10) can be rewritten in the final form: Let us illustrate this procedure with the exemplary domain partitioning scheme presented in Figure 1a.There are two external ports P n ∈ {P 1 , P 2 } and two internal ports P m ∈ {P 3 , P 4 }, which connect three subdomains.Split ports P m on the separated subdomains are shown in Figure 1b.The mass matrices K k derived for each subdomain take the following form: For the stiffness matrices M k , the above formulas apply if the letter K is replaced by M.

Model-Order Reduction with Diagonalized Macromodels
The entire model-order reduction process is performed locally in each subdomain.As a result, the reduced-order macromodels are created to replace the corresponding matrix blocks in the local FEM system of equations in Equation (12).Consider a single subdomain, as defined in Section 2.2, enclosed by physical boundary Γ 0 and the ports, both external and internal.For the sake of clarity, we temporarily, until the final matrix assembly, omit the global numbering, and thus any subdomain under consideration is now denoted as Ω instead of Ω k .The local ports in each Ω is denoted as P l , for l = 1, 2, . . .L, while P represents a collection of all them.In this perspective, Equation (10) from which subscripts k are removed is a starting point for the presented procedure.

Port Compression
The size of a macromodel and also the computational cost of MOR procedure grow with the size of its all ports p counted as their overall number of degrees of freedom (DOF).It is, therefore, desirable to decrease p before the subsequent steps of MOR.A geometrical reduction of DOF by local mesh coarsening at the ports is presented in [15] for 2-D problems.That technique is simple but inefficient in 3-D FEM analysis, for which a different approach based on orthogonal projection, referred to as port compression, is proposed in [16].The 2-D FEM basis on each port surface is projected onto a new subspace defined by a basis of orthogonal functions F l .As they span entire port surface, much fewer functions in F l are needed than those in FEM basis W l .To simplify definition of the new basis functions, the ports are selected as surfaces conforming to basic coordinate systems, such as Cartesian, cylindrical or spherical.Each port may use its own 2-D local coordinates (q 1 , q 2 ) being a subset of a locally chosen 3-D system.
The tangential electric vector in port P l is expanded into the series of the orthogonal functions from the basis F l = { f 1 , . . ., f j , . ..}.For f j ∈ { i 1 f 1j (q 1 , q 2 ), i 2 f 2j (q 1 , q 2 )}, where ( i 1 , i 2 ) are unit vectors of the coordinates (q 1 , q 2 ), j = 1, . . ., N Pl and N Pl = N 1 + N 2 , this expansion is as follows: where e Pl is the vector of coefficients {e 1i , e 2i } corresponding to DOF in the compressed port P l .Similarly, the tangential fields have also been expanded by means of the FEM basis W l = { w 1 , . . ., w i , . ..},where w i are 2-D FEM basis functions on the surface of port P l as defined in Equation ( 5), i = 1, . . ., N Pl and vector e Pl represents DOF in the original FEM port: Then, the port compression can be written as the following projection: The projection basis F l is a N Pl × N Pl matrix of N Pl functions in the continuous basis F l discretized on N Pl -element FEM basis W l : M Pl is a local mass matrix defined as M Pk in Equations ( 8) and (11).For a collection of all ports P on ∂Ω, the projection in Equation ( 16) reads as: and is applied to all matrix blocks in Equation ( 10), yielding the system with compressed ports P: where Although subscripts k have been omitted for the sake of clarity, this equation applies to each separate subdomain Ω k .The overall number of DOF in the original and compressed ports of a single subdomain is N P = ∑ l N Pl and p = N P = ∑ l N Pl , respectively, where N P N P .Assuming that each subdomain has k p ports compressed to p 0 DOF each, p = k p • p 0 .
On the ports whose surfaces are constrained by physical boundaries of the structure, the trigonometric expansion with sine and cosine functions is possible as Equation (14).A typical example of such case is the cross section of a rectangular waveguide where Equation ( 14) is equivalent to the modal expansion by means of p 0 waveguide modes.The sine and cosine functions are also a natural choice on a cylinder or sphere along full turns of angular coordinates.For the surfaces without prescribed boundary conditions, such as walls of a floating cube box, one may resort to Lagrange polynomials.More details on the port compression by means of the aforementioned expansions are in [16,26].

Model-Order Reduction
The FEM model-order reduction procedure used in the proposed macromodeling technique was previously proposed by Fotyga et al. [15] and Fotyga et al. [16] for 2-D and 3-D problems, respectively.It utilizes the techniques developed for the second-order linear time-invariant systems (LTI) [3], where the response is defined as a matrix transfer function H(s) between inputs and internal states in the Laplace domain.To adopt this approach in FEM for the local MOR, this transfer function is derived from the system in Equation (19) for each subdomain as follows: for the complex frequency parameter s = jk 0 = jω/c.The objective of MOR is to create the reduced-order model representing the following transfer function: H(s) is supposed to approximate the transfer function in Equation ( 21) in a limited frequency range in order to capture the behavior of the fields in Ω with respect to its ports P by significantly fewer unknowns e Ω .To this end, the original vector of unknowns e Ω , whose length is N Ω , is projected onto a new N Ω -dimensional solution space by means of an appropriately constructed basis V being a N Ω × N Ω matrix of orthonormal vectors such that N Ω N Ω .The N Ω × N Ω matrices K Ω and M Ω become dense but are significantly smaller than K Ω and M Ω .They are regarded as the macromodel, which represents the transfer function H(s) when considered together with their respective coupling matrices S K and S M .The macromodel size is denoted as N m ≡ N Ω .Equation ( 22) is transformed back to the matrix form (Equation ( 19)) yielding the reduced system: where the projection by means of V is performed as: To generate the reduction basis V, we employ the Efficient Nodal Order Reduction (ENOR) algorithm proposed in [27] for multiport RLC circuits, which refers to the transfer function of the form: This algorithm can by applied to the FEM transfer function in Equation ( 21) for the following substitution: The columns in V are combined as first q block moments of e Ω expanded around a frequency s 0 , each of them being a N Ω × p matrix.By increasing the reduction order q, the accuracy of the macromodel improves but it is achieved at the price of higher numerical cost of its generation and larger macromodel size N m = qp.Higher orders q are needed for larger subregion and stronger field variations, which is usually due to discontinuities and small geometric features within the subdomain.
As the projection basis, V is computed for a single expansion frequency s = s 0 ; it is frequency independent and valid within a certain bandwidth depending on the required accuracy.The projection in Equation ( 24) is also performed only once in this bandwidth and does not introduce any frequency dependent terms in the macromodel matrices.For this reason, and since they are very small, the high cost of macromodel generation can be compensated by fast frequency sweeps during the solution of the final system of equations, which eventually leads to a significant analysis speedup.
In structures which comprise repeating subregions, cloning of macromodels is possible.It means that the reduction is not performed in the subdomains represented by the macromodels that have already been created elsewhere and can be copied to different locations.The matrices and vectors corresponding to the cloned macromodels are added in the global system after the diagonalization.This allows for substantial reduction of the overall computational cost of MOR and diagonalization.

Diagonalization
In the time-domain FEM, only mass matrix needs to be diagonalized in order to accelerate the solution time stepping.This technique, known as mass lumping [20], involves an efficient diagonalization performed on the element level.For the diagonalization to accelerate the system solution in case of the frequency-domain FEM, one must resort to more complex algorithms, which diagonalize both mass and stiffness matrices.They are known in the literature as simultaneous diagonalization [22,28] and usually refer to generalized eigenvalue problems.As they all involve orthogonal decomposition of matrices in the global system, which produces dense intermediate matrices, their numerical cost in case of large sparse FEM systems would exceed the solution cost of the original FEM problem.Based on the same generic framework, a new approach is proposed, which brings a substantial advantage of diagonalization to the frequency-domain FEM analysis by combining it with local model-order reduction.To this end, the diagonalization is performed locally in separate subdomains, prior the global matrix assembly, and applies only to the macromodel matrix blocks K Ω and M Ω in Equation (23).The remaining blocks (K P , S K , M P , S M ) can be omitted, because they are much smaller than the macromodel blocks.Although they are present in the system matrix after diagonalization, their influence on the solution time is rather small and will be even further reduced by appropriate decomposition of the final system with the Schur complement technique.It should be noted that, due to the frequency independence of the macromodels, the diagonalization is extremely efficient, as it is performed only once for the whole analysis bandwidth.
As a result of the diagonalization, Equation ( 23) transforms into the following system: where D K and D M are diagonal matrices representing the diagonalized macromodel.Along with other blocks denoted by subscript D, which are new in this system, they are computed in one of the following algorithms. Diag-I Ω and compute and make the following substitutions in Equation ( 26): where I P and I M are unit matrices of the size of M P and M Ω , respectively.

Compute
4. Compute Q K and Λ K as the orthogonal decomposition Derive and calculate the following matrices as a result of left multiplication of Equation (26) by diag(I P , Q T K ): Diag-II

Perform the orthogonal decomposition M
= I M as a result of left multiplication of Equation ( 23) by diag(I P , X) and make the following substitutions in Equation ( 26): X S M → S MD , X S K → S KD , X e Ω → e D , where I P and I M are unit matrices of the size of M P and M Ω , respectively.

Compute K
Derive and calculate the following matrices as a result of left multiplication of Equation ( 26) by diag(I P , Q T K ): Diag-I is an original algorithm, while Diag-II adopts an approach similar to those presented in [22,28] for generalized eigenvalue problems.Both consist of the same Steps (3)-( 5), which provide diagonalization of stiffness matrix K Ω .The main difference between these algorithms lays in Steps ( 1) and ( 2) where in Diag-I there is an inverse of square root of the mass matrix M Ω instead of its orthogonal decomposition.It should be noticed, however, that usually the algorithms for matrix square root involve orthogonal decomposition.In this respect, both procedures should have comparable computational costs to which the orthogonal decompositions contribute most.To perform orthogonal decomposition, the following algorithms are possible: • eigendecomposition (EVD) QΛQ T , where diagonal matrix of eigenvalues Λ and orthogonal matrix of eigenvectors Q are directly used in the presented algorithms; • singular value decomposition (SVD) UΣV T with the substitution Λ = Σ and Q = U; and • Schur decomposition (SchD) QUQ −1 with Q being used directly and the diagonal of triangular matrix U being substituted for Λ.

Assembly and Solution of the Global System
Now, the matrices of the global system of equations are being finally assembled from Equation (26) rewritten for each subdomain Ω k by restoring subscripts k and i to indicate subdomain and its ports in the same form as in Equation ( 12).The resulting system takes the same form as Equation (26), in which all terms are denoted with superscript (a).They are defined as the following matrices: P = diag(. . ., K Pi , . ..), M P = diag(. . ., M Pi , . ..), where i = 1, . . ., P, k = 1, . . ., M and S KDik = S MDik = 0 for P i ⊂ Ω k .The same assembly procedure can be applied to the local FEM matrices in Equation (12).Although not required, this would provide a standard FEM system useful as a reference for investigation of the efficiency improvement and accuracy of the proposed procedures of MOR and diagonalization.
If there is a group of identical subdomains, macromodel cloning is recommended.In only one of them, 3-D mesh, FEM matrices and the diagonalized macromodel need to be created.The remaining subdomains are represented by multiple copies of this macromodel to be introduced as their respective submatrices in Equations ( 27)-(29).For example, if in the structure in Figure 1 all subdomains are assumed identical, Ω 1 may be selected as the one where the primary diagonalized macromodel is to be created.Subdomains Ω 2 and Ω 3 are represented by the matrices which are copied from Ω 1 in the following scheme depicted for the stiffness matrix K (it is the same for the mass matrix The resulting assembled system is solved in the following compact form: where e P = e MD .It is worth mentioning that this is the first time the frequency sweeping is involved, which means that computationally demanding operations of MOR and diagonalization have been performed just once for the whole frequency range.For the same reason, it also means that the subsequent procedures of solving Equation (30) are inside the frequency loop, and therefore should be thoroughly optimized.
The size of square blocks P and D is N To take full advantage of the fact that the dominating block in the system is just a diagonal matrix, we propose to partition Equation (30) and solve it with respect to e P and e D separately by means of the Schur complement technique [28] in the following procedure: Schur-complement based solver (Schur solver).2) is as trivial as a scalar division by its entries.Thus, time complexity of the Schur complement creation reduces to approximately one matrix multiplication S T S D involving very narrow matrices.Due to significantly smaller size of P and e P compared to D and e D , the solution of the system in Step (3) is even less costly.Step ( 4) is denoted as optional, because in most cases only the response with respect to the ports of analyzed structure is sought after.Therefore, computation of e D can be omitted to provide additional savings, if only the field distribution inside the domain volume is not needed.
It should be noted that the presented algorithm of the Schur solver does not involve direct domain partitioning and is performed in the global system at once.This is because the diagonal matrix D dominates in the system and its inversion scales perfectly with the number of subdomains, while P is small enough not to be worth partitioning.The only operation where some kind of partitioning can be successfully applied is the matrix multiplication S T S D .Since the coupling block S consists of sparsely distributed small dense matrices S ik = S KDik − k 2 0 S MDik (S D has the same structure), a dedicated block multiplication is proposed instead of the generic MATLAB multiplication.Unlike the MATLAB routine, this procedure takes advantage of the fact that there are many empty blocks in S and omits them during the multiplication.The occupancy ratio of nonzero blocks in S for cascaded macromodels is 2M/(M(M + 1)), which for M = 10 equals 18%.It is therefore expected to provide additional time savings, especially for large macromodels.

Numerical Results
All numerical experiments were performed on a PC with i7-4510U CPU @ 2.6 GHz and 16 GB RAM (model PORTEGE Z30-A-1E1, Toshiba, Tokyo, Japan).The proposed procedures were implemented in MATLAB (R2016a, MathWorks Inc., Natick, MA, USA).The built-in matrix operations and the linear algebra procedures from MATLAB libraries were utilized, the most important of which are: mldivide (generic direct solver for sparse systems of linear equations which for the systems being investigated uses block LDL factorization), eig (eigendecomposition), svd (singular value decomposition), schur (Schur decomposition), and lu (LU factorization).
As a test structure, the rectangular waveguide loaded with periodically distributed pairs of metallic cylindrical posts in E-plane was chosen (Figure 2).It is divided into M cascaded subdomains in such a way that Ω 1 and Ω M are empty sections of the waveguide and each of Ω The posts create M − 3 resonators so that the structure behaves as a band pass filter, which provides large variations in the frequency response.This makes the structure suitable for an accuracy analysis of the presented methods.Moreover, it allows for easy changing of the number of macromodels and cloning.For maximal cloning, Ω 2 . . .Ω M−1 are kept the same and so are Ω 1 and Ω M .In this case, there are M 0 = 2 unique subdomains Ω 1 and Ω 2 , so the mesh as well as the diagonalized macromodels need to be created only therein (see Figure 2b).The remaining subdomains are represented by the macromodels that are copied from these unique ones and introduced in the global system matrix during the final assembly.To this end, 2-D meshes on the ports of Ω 1 and Ω 2 are the same.Denoting the macromodel in Ω i as MM i , the cloning scheme reads as follows: MM 1 → MM M and MM 2 → {MM 3 . . .MM M−1 }.The test structure for varying number of macromodels (M = 4 . . .12) was analyzed towards S-parameters for the excitation mode TE 01 in the frequency band covering all fundamental resonances-from 7 GHz, which is right above the waveguide cut-off, to 16 GHz.Frequency sweeping with 201 frequency points and 45 MHz step was chosen to capture correctly the S-characteristics with sharp resonances.The 3-D mesh consists of 4457 and 12,071 tetrahedrons in Ω 1 and Ω 2 , respectively, while the 2-D mesh has 238 triangles in each port.The port compression for the model-order reduction of the order q was performed using modal expansion by means of the first p 0 analytically defined waveguide TE modes.
To demonstrate accuracy of the model-order reduction, the plots of S 11 and S 21 for FEM with MOR (FEM-MOR) are compared with the plain FEM in Figure 3a.For this test, the structure with six pairs of posts and M = 8 subdomains was chosen.A more detailed insight into the accuracy of MOR is given in Figure 3b as S-parameter error plots for the error defined as follow: It expresses the distance in dB on a complex plane between S ij for the analyses being compared.For the reduction order q = 10 and the port size p 0 = 10, the errors are below −45 dB in the full frequency band (mostly below −55 dB), which is sufficient to make the S-parameter plots indistinguishable.The number of unknowns in the FEM system is N = 80,326 and N = 1690 for FEM-MOR, which results in the reduction ratio N/ N = 47.5.To investigate the accuracy of diagonalization, the S-parameters for FEM-MOR with diagonalization (FEM-MOR-Diag) were compared with FEM-MOR.The error defined similarly to Equation (32) was below −240 dB for both diagonalization algorithms and each of the possible orthogonal decompositions (EVD, SVD, and SchD), meaning the diagonalization contributes only a negligible addition to the error introduced by MOR itself, thus the plots in Figure 3 would look exactly the same if FEM-MOR were replaced by FEM-MOR-Diag.
The subsequent tests focused on the performance of macromodel diagonalization in terms of computational time.The diagram in Figure 4 defines the execution times of the steps leading from the original FEM system to the solution of the systems of equations at different stages of the procedure involving MOR and diagonalization.The blocks MM and MM-Sol represent the standard MOR procedure presented in [16] to which the proposed MOR with diagonalized macromodels was compared in the following tests.
In the initial approach, only the solution times t RS , t DS and t DSS were considered to show direct effects of diagonalization and the Schur solver.The block matrix multiplication was assumed in the Schur solver if not stated otherwise.The times t RS and t DSS are compared for M = 8 macromodels in Figure 5 as the plots versus port size p 0 and reduction order q.They both grew with the problem size N (dependent on q and p 0 ) similarly, however the diagonalization combined with the Schur solver clearly brought a substantial decrease of the solution time.For better insight into these effects, we analyzed solution speedup (speedup ratio), referred to as gain, which is defined as follows: diagonalization gain G DS = t RS /t DSS , Schur solver gain G S = t DS /t DSS .The latter is a speedup component in G DS contributed by the Schur solver used instead of the MATLAB solver.Since the diagonalization gain depends on two factors determining the problem size N-macromodel size 2qp 0 and the number of macromodels Mit is depicted as a pair of plots in Figure 6, versus p 0 and M, both with q as a parameter.The G DS plots show a significant speedup, which grows almost proportionally with q and p 0 , however, the influence of q on G DS is roughly twice as large as that of p 0 .It makes the proposed diagonalization procedure particularly attractive for problems with strong field variations within subdomains.What is more, G DS was almost independent of M-it deteriorateD very slowly with M increasing from 4 to 12, and thus diagonalization is suitable even for analysis with large number of macromodels.
The Schur solver gain plots versus p 0 are shown in Figure 7 for two options of matrix multiplication.The dedicated block matrix multiplication improved the Schur solver performance more for larger values of p 0 (Figure 7a), which makes it more appealing than the standard MATLAB matrix multiplication (Figure 7b).Although the latter brought higher gain for small p 0 , it is much less relevant, as the corresponding time savings expressed in absolute values were much smaller that those for large p 0 .When comparing G DS with the gain of diagonalization involving the standard MATLAB solver equal G DSS /G S , one may notice that the Schur solver brought an important contribution to the overall speedup.For instance, for the largest values of q and p 0 , the Schur solver increased the diagonalization gain from 9.2 to 40.5.For a more complete and practical view of performance of the presented procedures, the computational costs of MOR t R and diagonalization t D have to be taken into account.To this end, the following definitions of effective gains are introduced: When referring to the diagram in Figure 4, these effective gains correspond to different routes leading from different starting systems to all possible solutions.G DSeff compared the times needed to reach MM-Sol and DiagMM-SchurSol from the box MM (DiagMM-Sol was disregarded as always t DSS < t DS ).When starting from the original FEM system, two routes were compared with t FS : one leading via MM and the other one via MM and DiagMM.They defined G Reff and G RDSeff , respectively, and expressed the overall performance improvement brought by MOR and MOR with diagonalized macromodels as compared to the plain FEM solution.
The solution time of the original FEM system by means of the MATLAB solver mldivide t FS was 104, 372 and 638 for 4, 8 and 12 subdomains, respectively.All times are given in seconds if not stated otherwise.The reduction time for the unique subdomains Ω 1 and Ω 2 is t R1 = 0.51 and t R2 = 1.82, respectively, for q = 10, p 0 = 10.Since the diagonalization time depended only on the macromodel size N m = 2qp 0 regardless of the physical properties of its corresponding subdomain and was the same for all macromodels, it is presented for just one of them and denoted as t D1 .Figure 8a compares t D1 for all possible combinations of the orthogonal decompositions-SVD, SchD, and EVD-used in Steps ( 1) and ( 4) of the diagonalization algorithm Diag-I.As the plots in Figure 8b show for the best two options-SchD-SVD and EVD-SVD-both algorithms Diag-I and Diag-II had almost the same performance.For further analysis, the configuration SchD-SVD-Diag-II was chosen.The overall MOR and diagonalization times t R and t D depended on how much macromodel cloning was involved in the analysis, which eventually influenced the aforementioned effective gains.To demonstrate the role of cloning, the following two options were considered: • Maximal cloning possible in the analyzed structure: only M 0 = 2 original macromodels were generated in Ω 1 and Ω 2 out of all M macromodels used: t D = 2t D1 , t R = t R1 + t R2 .• No cloning: MOR and diagonalization was performed in all M subdomains: The computation times defined in Figure 4 and the derived gains are summarized in Table 1 for small, medium and large macromodels (N m = 72, 200 and 360, respectively) and the above-mentioned cloning options.The influence of the macromodel count (M = 8 and 12) is presented for their medium size, which corresponds to the S-parameters characteristics in Figure 3.More detailed profiles of the resultant gain of the proposed analysis G RDSeff are presented in Figure 9 for maximal cloning and Figure 10 for the case without cloning.
Table 1.The times of MOR, diagonalization and solution of the systems of equations (in seconds) along with MOR and diagonalization gains for selected combinations of q, p 0 , M and two options of cloning.The influence of diagonalization cost on the effective diagonalization gain depended mostly on cloning-G DSeff decreaseD with respect to G DS less than 1.2 times with cloning and not more than 2 times without it, thus G RDSeff = 20.1 for the largest macromodels.The most appealing result is that the overall effective speedup of MOR with diagonalization G RDSeff could reach extremely high rates for small macromodels, i.e. as large as 369.5 times for N m = 2 q p 0 = 72 (q = 6, p 0 = 6), if the maximal possible cloning was involved.It increased from 183.2 times for MOR only, owing to the diagonalization.Without cloning, G RDSeff reduceD down to 81.4; however, this was still a very high speedup rate.This effect was mostly related to the loss of G Reff itself, because, although the lack of cloning affected equally the resultant costs per macromodel of both MOR and diagonalization, the influence of t R prevailed as it is much larger than t D .

Cloning
More details regarding the influence of cloning on the resultant effective speedup can be seen by comparing the plots in Figures 9 and 10.Although cloning apparently improved the overall performance, it nearly did not change the degree of influence of the macromodel size on G RDSeff .The plots in Figures 9a and 10a look almost the same and they are scaled by the factor approximately equal M/M 0 .The analogous relations regarding the number of macromodels M were rather different.In the case of cloning, G RDSeff increased largely with M, because the cost of MOR and diagonalization per macromodel decreased (Figure 10b).Without cloning, this did not occur and G RDSeff remained almost constant with respect to M with only slight tendency to increase (Figure 10b).This is an important property, as it shows that, even without cloning, which was the worst case, the resultant performance was not affected by the number of macromodels, and thus the proposed approach is advantageous even for the structures partitioned into an increasing number of subdomains.
Another important observation can be made concerning the dependence of the effective gains on macromodel size.G RDSeff decreased severely due to the growth of the macromodel generation cost t R and the solution time t RS .This effect was inherited after G Reff but was partly compensated by the diagonalization, owing to two factors:G DSeff was large enough that t DSS + t D t RS and it grew with N m .As a result, for N m changing from 72 to 360, the initial decrease rate of G Reff equalled 21 times (with cloning) or 11.3 times (without cloning), reducing to 6 times for G RDSeff .Favoring larger macromodels, the diagonalization brought larger relative speedup where it implied more absolute time savings, and thus was needed most.For instance, in the case of the results presented in Figure 3 (q = 10, p 0 = 10, M = 8, cloning with M 0 = 2), which involved medium size macromodels, the initial time of the FEM solution was reduced from over 6 min (t FS ) to less than 3 s for MOR with diagonalization (t RDS = t R + t D + t DSS ).It may be concluded that the diagonalization was particularly beneficial if macromodels represented large subdomains having complex shapes.This effect was even stronger if cloning was applied: the improvement ratio of the effective gain due to the diagonalization G RDSeff /G Reff for the largest macromodels (N m = 360) and M = 8 was 2.3 and 7.1 without and with cloning, respectively.
To show the performance of the proposed methods in the case of tuning, the following scenarios were investigated.The structure presented in Figure 2 for M = 8 subdomains was regarded as an initial design of a waveguide bandpass filter, which may be tuned by modifying the subdomain lengths L k and the distances between the posts g k .The structure was assumed symmetrical with respect to its external ports, thus the following subdomains and their macromodels were identical: Ω 2 = Ω 7 , Ω 3 = Ω 6 , Ω 4 = Ω 5 .Consequently, there were six possible independent tuning parameters: L 2 = L 7 , g 2 = g 7 , L 3 = L 6 , g 3 = g 6 , L 4 = L 5 , g 4 = g 5 .Due to the symmetry, half of all M macromodels were cloned, thus M 0 = 4.The MOR parameters were p 0 = 10 and q = 10, corresponding to the S-parameter characteristics depicted in Figure 3a.Two scenarios of tuning were considered: (A) with all N v = 6 tuning parameters in all M v = 3 modified subdomains; and (B) with N v = 2 in a single subdomain (M v = 1).The tuning procedure involved a simplified gradient optimization algorithm.In each iteration (out of K), two steps were performed: (1) estimation of the goal function gradient by small perturbations of all N v tuning parameters; and (2) calculation of a new parameter vector.In the case of MOR, Step (1) required N v solutions performed in just one subdomain, whereas, in Step (2), all M v macromodels were recalculated.
The results are presented in Table 2 for two different iteration counts K.The computation times and gains were defined as in previous tests, but the times were cumulated throughout all K tuning iterations.Consequently, the resultant cost of MOR, diagonalization and solution of the final system was: t FS for the plain FEM, t RSS = t R + t RS for FEM-MOR and t RDS = t R + t D + t DSS for FEM-MOR with diagonalized macromodels.The corresponding effective gains show that the speedup brought by MOR and MOR with diagonalization was significant and comparable to that of a single simulation for maximal cloning (M 0 = 2) with the same p 0 , q, M, as presented in Table 1.What is also important, the speedup was not or hardly dependent on the iteration count and the number of tuned parameters, which makes the proposed methods robust and thus suitable for CAD applications.
Table 2.The cumulated times of MOR, diagonalization and solution of the systems of equations (in seconds) along with MOR and diagonalization gains for two tuning scenarios, q = 10, p 0 = 10, M = 8, M 0 = 4, and for varying numbers of tuning parameters N v , modified subdomains M v and iterations K.In the final test, the structure depicted in Figure 2 for M = 8 was simulated by means of two commercial software packages-FEKO (version 2018-309, Altair Engineering, Troy, MI, USA) and EMPro (version 2019, Keysight Technologies, Santa Rosa, CA, USA)-which belong to the leading CAD tools for electromagnetic analysis based on FEM.Similar mesh size, first-order FEM basis function and 201-point frequency sweeping were set to make the results comparable with the analysis presented in this paper.

Scenario
In Figure 11, the S-parameter characteristics are compared and a view of the structure meshed in FEKO is depicted.The plots are almost indistinguishable, which additionally proves very good accuracy of the proposed method.The solution times for the commercial software and this analysis are compared in Table 3.For the FEM-MOR-Diag, the solution time was t R + t D + t DSS .The results show that even an in-house FEM software based on the proposed method of MOR with diagonalized macromodels could significantly outperform commercial FEM software packages.

Conclusions
A new technique of local model-order reduction (MOR) in 3-D finite element method (FEM) for electromagnetic analysis of waveguide components has been proposed to resolve the problem of increasing solution time of the reduced-order system combined from the macromodels in a decomposed domain.To this end, the diagonalized macromodels created by means of the simultaneous diagonalization are used to build the global system.To the best of the author's knowledge, diagonalized macromodels have not been used previously in FEM analysis to speed up the solution of the system obtained in MOR.The proposed approach is very efficient for two reasons: diagonalization is performed on small macromodel matrices and can be carried out just once in the whole bandwidth, which is owing to the frequency independency of macromodels.Although the resulting matrix is only partially diagonal, it can be solved very efficiently by a dedicated solver based on the Schur complement technique, which has also been proposed.The numerical validation of the proposed procedures with respect to accuracy and speed was carried out for an exemplary finite periodical waveguide structure partitioned into the macromodels of varying size and count and for different options of macromodel cloning.The results show that the introduction of diagonalized macromodels in this work provided a significant solution speedup without accuracy degradation.This makes an essential performance improvement in comparison to the work in [16], where similar methods of MOR and port compression are used.The solution time reduces to such extent that the resultant efficiency of the entire analysis becomes determined almost solely by the cost of MOR.It means that proposed technique eventually improves robustness of the model-order reduction with respect to macromodel size.Although the overall speedup of MOR with diagonalized macromodels still decreases with the growth of macromodels, the decrease rate is lower than that without diagonalization.The proposed technique is particularly beneficial, when the system solution time becomes comparable to the reduction time, which occurs for growing size and count of the macromodels.It also takes place when they are cloned in multiple locations of the structures or are used repeatedly in a tuning and optimization process, which makes the proposed technique desirable in CAD applications.

DP
= ∑ M k=1 N Ωk , respectively.The total number of unknowns after diagonalization is equal to those after MOR and for structures represented by cascaded macromodels (k p = 2) it is:= 2Mqp 0 + (M + 1)p 0 .(31)Diagonalmatrix D is the largest block in the system and usually N the factor 2q. If, for instance, the analyzed structure consists of M = 10 macromodels of order q = 10 and 11 ports compressed to p 0 = 10 DOF each, then N (a) P = 110, N (a) D = 2000 and N = 2110.

Figure 2 .
Figure 2. The test structure-rectangular waveguide loaded with pairs of metallic cylindrical posts in E-plane and divided into M subdomains Ω 1 . . .Ω M .The dimensions are shown in x-z plane view (a).The waveguide height is b.The perspective view (b) for M = 6 shows that the mesh is generated in Ω 1 and Ω 2 only.

Figure 3 .
Figure 3. S-parameters (a); and MOR error S 11err and S 21err (b) of the structure in Figure2for M = 8 subdomains and macromodels, port size p 0 = 10 and reduction order q = 10.

Figure 4 .
Figure 4. Diagram defining the time cost of MOR, diagonalization and solution of the systems of equations.Abbreviations: FEM, initial FEM system; MM, system with macromodels; DiagMM, system with diagonalized macromodels; -Sol, solution by means of the MATLAB solver mldivide; -SchurSol, solution by means of the Schur solver.

Figure 5 .
Figure 5. Solution time of the final system for M = 8 macromodels: original macromodels (a); and diagonalized macromodels and the Schur solver (b).

Figure 6 .
Figure 6.Gain of diagonalization with the Schur solver G DS : versus port size p 0 for M = 8 (a); and versus number of macromodels M for p 0 = 10 (b).

Figure 7 .
Figure 7. Gain of the Schur solver over the standard MATLAB solver for the system comprising diagonalized macromodels.The Schur solver is in two options: with (a) or without (b) the dedicated block matrix multiplication.

Figure 8 .
Figure 8. Diagonalization time t D1 for one macromodel of the size N m = 2qp 0 for the following configurations: algorithm Diag-I with all possible combinations of orthogonal decompositions (a); and both algorithms Diag-I and Diag-II with the best performing orthogonal decompositions (b).

Figure 9 .
Figure 9. Effective gain of MOR with diagonalization G RDSeff , with cloning: versus port size p 0 for M = 8 (a); and versus number of macromodels M for p 0 = 10 (b).

Figure 10 .
Figure 10.Effective gain of MOR and diagonalization G RDSeff , without cloning: versus port size p 0 for M = 8 (a); and versus number of macromodels M for p 0 = 10 (b).

Figure 11 .
Figure 11.S-parameters obtained in the commercial software and this analysis (a); and a view of the tested structure meshed in FEKO (b).

Table 3 .
The solution times for the commercial software and this analysis.