A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms †

Efficient full-chip thermal simulation is among the most challenging problems facing the EDA industry today, especially for modern 3D integrated circuits, due to the huge linear systems resulting from thermal modeling approaches that require unreasonably long computational times. While the formulation problem, by applying a thermal equivalent circuit, is prevalent and can be easily constructed, the corresponding 3D equations network has an undesirable time-consuming numerical simulation. Direct linear solvers are not capable of handling such huge problems, and iterative methods are the only feasible approach. In this paper, we propose a computationally-efficient iterative method with a parallel preconditioned technique that exploits the resources of massively-parallel architectures such as Graphic Processor Units (GPUs). Experimental results demonstrate that the proposed method achieves a speedup of 2.2× in CPU execution and a 26.93× speedup in GPU execution over the state-of-the-art iterative method.


Introduction
The evolution of the manufacturing technology of Integrated Circuits (ICs) has continued unabated over the past fifty years, according to the predictions of Moore's law and has led to extremely complex circuits (modern processors contain several billion transistors and are easily the most complex human construction), but also to analogous escalation of the problems related to the analysis and simulation of such circuits.Therefore, thermal analysis is one of the most critical challenges arising from the technological evolution.The continuous effort for smaller sizes, in the sub-45-nm era, and greater performance, as well as the new 3D structures have begun to outpace the ability of heat sinks to dissipate the on-chip power.
In particular, aggravation of thermal effects is an inevitable consequence of the continuous scaling trend.High temperature has a significant impact on chip performance and functionality, leading to slower transistor speed, more leakage power, higher interconnect resistance, and reduced reliability [1].The problem becomes more pronounced in modern technologies due to the multilayer 3D stacking, and the use of new device technologies, like FinFETsand Silicon on Insulator (SOI), which are more sensitive to the self-heating effect [2].Furthermore, as heat generation is nonuniform, local hotspots and spatial gradients appear.Stacking multiple layers in a 3D chip promises density and performance enhancement.However, it requires extensive thermal analysis as the power density and temperature of these architectures can be quite high.For the above reasons, full chip thermal analysis is a vital, but extremely difficult problem due to the size of the systems that need to be solved for multiple time points and remains a key issue for future microprocessors and ICs [3,4].Due to this fact, IC thermal analysis problems have drawn considerable attention over the past two decades.To deal with these challenges, prior approaches have focused on the formulation of the problem and the fast steady-state and transient thermal simulation in order to compute the temperature across the whole chip.
Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems.Unfortunately, these methods do not scale well with the dimension of the linear system and are prohibitively expensive, while the thermal problems are becoming larger, in both execution time and memory requirements.On the other hand, iterative Krylov-subspace methods such as Conjugate Gradients (CG) involve only inner products and matrix-vector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally-and memory-efficient.
Moving beyond conventional direct solvers, our early work in [5] introduced an approach for full chip thermal analysis that is based on the Finite Difference Method (FDM) or the formulation of an RC equivalent electrical network in conjunction with a highly parallel iterative Krylov-subspace Preconditioned Conjugate Gradient (PCG) method, which overcomes the computational demands for the very large systems arising from the thermal modeling.In particular, the contributions of this paper to the problem of thermal analysis are:

•
Accelerated solution of thermal grids: The proposed thermal simulator uses FDM with preconditioned CG, which is well-suited, offers faster solution times, and uses less memory than sparse direct solvers.

•
Highly parallel preconditioned mechanism: The specialized structures of thermal grids allow the usage of fast transform solvers as a preconditioned mechanism in the CG method, which are highly parallel and can be easily ported to GPUs.

•
Fast convergence to the solution: Fast transform solvers can handle different matrix blocks, which offers a good preconditioner approximation.This results in considerably more accurate preconditioners that can approximate thermal grids and make CG converge to the final solution in a few iterations.
Experimental results demonstrate that our method achieves speedups of around 20× on GPU and around 2× on CPU for a 10 M node thermal grid over a state-of-the-art iterative method, like Incomplete Cholesky Preconditioned Conjugate Gradient (ICCG) on a CPU.
The rest of the paper, is organized as follows.Section 2 describes the related work on the thermal simulation problem.Section 3 introduces the thermal model that was used in the present work.Section 4 provides a brief description of the 3D fast transform solver.Section 5 describes the proposed approach, combining the methods presented in the two previous sections.Finally, Section 6 presents the results and a discussion about the advantages of the method, followed by the conclusions in Section 7.

Related Work
The growing need to simulate large-scale thermal models in technology nodes below 45 nm has led to some important research in the fast thermal estimation of the IC chips.In this section, we briefly review some of these methods.Most transient thermal analysis methodologies have so far relied on solving the entire system, using different modeling techniques, based mainly on the Finite Element Method (FEM), the Finite Difference Method (FDM), and Green's functions.The research work in [6,7] adopted the FDM method, with a multigrid approach in order to speed up the simulation process, and the FDM method with temporal and spatial adaptation to further accelerate thermal analysis was proposed in [8,9].Similarly, in [10], the full-chip thermal transient equations were solved in a similar manner using an Alternating Direction Implicit (ADI) method for enhanced computational efficiency.Furthermore, in [11,12], the FDM approach and the RCequivalent were used along with modeling of the fluids for micro-cooling 3D structures.In [13], FEM was adopted for 2D and 3D geometries along with a multigrid preconditioning method and automatic mesh generation for chip geometries.Finally, Green's functions were used in [14] with discrete cosine transform and its inversion in order to accelerate the numerical computation of the homogeneous and inhomogeneous solution.However, these method are efficient for a limited range of problems, since they have limited potential for parallelism.
Besides the previous conventional approaches, different methods like a Neural Net (NN) approach were used in [15], but since it was based in predictions, it did not always provide an accurate solution to the crucial problem of thermal analysis.Moreover, a Look Up Table (LUT) method based on the power thermal relation, which develops a double-mesh scheme to capture thermal characteristics and store the results in library files, was presented in [16].However, currently, chips can lead to huge library files due to the highly complex combined heat maps.Furthermore, the reduction of the problem size, through a Model Order Reduction (MOR) process, was proposed in [17,18], which can be useful in addressing the performance of individual devices, but in some cases, it is not enough to address all the reliability issues.
Finally, the authors in [19] provided a parallel iterative Generalized Minimal RESidual (GMRES) method for FDM and micro-cooling problems, but without any special preconditioning approach.Clearly, the concept of a dedicated fully-parallel preconditioned technique has not yet been introduced in the context of transient thermal analysis.

On-Chip Thermal Modeling and Analysis
There are three modes of heat transfer: conduction, convection, and radiation.The primary mechanism of heat transfer in solids is by conduction, and the others can be neglected.The starting point for thermal analysis is Fourier's law of heat conduction [20]: which states that the vector of heat flux density q (heat flow per unit area and unit time) is proportional to the negative gradient of temperature T at every spacial point r = [x, y, z] T and time t, where k t is the thermal conductivity of the material.The conservation of energy also states that the divergence of the heat flux q equals the difference between the power generated by external heat sources and the rate of change of temperature, i.e., where g(r, t) is the power density of the heat sources, c p is the specific heat capacity of the material, and ρ is the density of the material.By combining ( 1) and ( 2), we have: which may be rewritten as the following parabolic Partial Differential Equation (PDE): (normally accompanied by appropriate boundary conditions [21]).
A common procedure for the numerical solution of ( 4) is by discretization along the three spatial coordinates with steps ∆x, ∆y, and ∆z and substitution of the spatial second-order derivatives by finite difference approximations, leading to the following expression for temperature T i,j,k at each discrete point (i, j, k) in relation to its neighboring points: or by multiplying by ∆x∆y∆z: There is a well-known analogy between thermal and electrical conduction, where temperature corresponds to voltage and heat flow corresponds to current (see Table 1).In light of this analogy, Equation ( 6) has a direct correspondence to an electrical circuit where there is a node at every discrete point or cell in the thermal grid (see Figure 1).Every circuit node is connected to spatially-neighboring nodes via conductances in the directions x, y, z with values: and there is a capacitance to ground at every node or thermal cell with value: The heat sources constitute input excitations and are modeled in the equivalent circuit as the current sources with values: The above current sources are connected at the specific points (i, j, k) or circuit nodes where there is heat flow (i.e., power dissipation from the underlying chip logic blocks).The resulting electrical equivalent circuit is described in the time domain, using the Modified Nodal Analysis (MNA) framework, by a system of Ordinary Differential Equations (ODE): where G ∈ R n×n is a symmetric and positive definite matrix of the conductances (7), C ∈ R n×n is a diagonal matrix of cell capacitances (8), x ∈ R n is the vector of unknown temperatures T i,j,k at all discretization points (constituting internal states of the system), and u ∈ R p is the vector of input excitations from the current sources I i,j,k of (9).
For transient simulation, we can discretize the time interval into time instants t k , k = 1, 2, . . .and use the backward-Euler numerical integration method for the calculation of temperature in each discrete time instant t k : where . . is the time step of time t k (which may in general vary during transient analysis).The above equation involves the solution of a very large sparse linear system in each time instant t k .Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems.Unfortunately, these methods do not scale well with the dimension of the linear system and become prohibitively expensive for large-scale networks.Iterative methods involve only inner products and matrix-vector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally and memory efficient.In this work, we employ iterative methods for the solution of large-scale networks arising from the 3D discretization of the chip for thermal analysis.The system matrices arising from the modeling of the thermal grid can also be shown to be Symmetric and Positive Definite (SPD), which allows the use of the efficient method of the Conjugate Gradient (CG) for the solution of the corresponding linear systems.The CG method is well known [22], and its implementation is shown in Algorithm 1.
Regarding the convergence rate of CG, it can be shown [23] that the required number of iterations (for a given initial guess and convergence tolerance) is bounded in terms of the spectral condition number k λ min (A) where λ max (A), λ min (A) are the maximum and minimum eigenvalues of A, respectively.This means that convergence of CG is fast when k 2 (A) ≈ 1 and slow when k 2 (A) >> 1.
To improve the convergence speed, it is necessary to apply a preconditioning mechanism, which transforms the initial linear system into an equivalent one with a more favorable spectral condition number.The so-called preconditioner is a matrix M that approximates A in some way, such that the transformed system M −1 Ax = M −1 b (which obviously has the same solution as the initial Ax = b) exhibits condition number k 2 (M −1 A) = k 2 (I) = 1.In practice, it is not necessary to invert the preconditioner M and apply it directly at the system Ax = b.It can be shown that the same thing can be accomplished by introducing an extra computational step within the iterative method, which entails solving a system Mz = r with known Right-Hand Side (RHS) vector r and unknown vector z in every iteration [23].
From the above, it follows that a good preconditioner M must satisfy two key properties:

•
The fast convergence rate of the preconditioned.• A linear system involving M is solved much more efficiently than the original system that involves A.
where "more efficiently" can mean with less asymptotic complexity-ideally, an optimal or near-optimal complexity of O(N) or O(NlogN)-and/or significantly more parallelism in the solution procedure.If the preconditioner is faithful enough to reduce the iterations substantially, then the whole burden of the algorithm is transferred to the preconditioner-solve step Mz = r.The next section will describe the proposed form of the preconditioner matrices for 3D thermal networks, as well as the solution of the corresponding linear systems via two series of fast transforms and inverse fast transforms.

Fast Transform Preconditioners for 3D Thermal Networks
Recent implementations of fast transform solvers have shown great potential for the solution of block-tridiagonal systems with a special structure [24,25].This section describes such an algorithm for the solution of an appropriate preconditioner system Mz = r by the use of a fast transform solver in a near-optimal number of operations.
Let M be an N × N block-tridiagonal matrix with l diagonal blocks of size mn × mn each (overall N = lmn), where l is very small (typically 5-8 depending on the material layers (metal and insulator) of the chip), with the following form: where I mn is the mn × mn identity matrix and M i , i = 1, . . ., l, are block tridiagonal mn × mn matrices of the form: where I n is the n × n identity matrix and T i , i = 1, . . ., m are n × n tridiagonal matrices with the following form: This class of tridiagonal matrices has a beforehand known eigen-decomposition.Specifically, it can be shown [26] that each T i has n distinct eigenvalues λ i,j , j = 1, . . ., n, which are given by: and a set of n orthonormal eigenvectors q j , j = 1, . . ., n, with elements: Note that, the eigenvectors do not depend on the values α i and β i and are the same for every matrix T i .If Q n = [q 1 , . . ., q n ] denotes the matrix whose columns are the eigenvectors q j , then due to the eigen-decomposition of T i , we have By exploiting the diagonalization of the matrix T i and considering that Q T n Q n = I, the system Mz = r is equivalent to the following system: where: i,n ) are diagonal matrices with the eigenvalues of T i + γ i I n , T i + 2γ i I n , which are the following: If the N × 1 vectors r, z, r, z are also partitioned into m blocks of size n × 1 each, i.e., . ., m.However, it can be shown [27] that each product Q T n r i = ri corresponds to a Discrete Cosine Transform of Type-II (DCT-II) on r i , and each product Q n zi = z i corresponds to an Inverse Discrete Cosine Transform of Type-II (IDCT-II) on zi .This means that the computation of the whole vector r from r amounts to m independent DCT-II transforms of size n, and the computation of the whole vector z from z amounts to m independent IDCT-II transforms of size n.A modification of the Fast Fourier Transform (FFT) can be employed for each of the lm independent DCT-II/IDCT-II transforms [27], giving a total near-optimal operation count of O(mn log n) = O(N log n).
Algorithm 2 Preconditioner solution for the thermal grid.
1: Partition vector r into lm sub-vectors r i of size n, and perform DCT-II transform (Q T n r i ) on each sub-vector to obtain transformed vector r.2: Partition vector r into l sub-vectors ri of size mn, and permute each subvector by permutation P, which orders elements as 1, n + 1, . . ., (m − 1)n + 1, 2, n + 2, . . ., (m − 1)n + 2, . . ., n, n + n, . . ., (m − 1)n + n, in order to obtain vector rP 1 .3: Partition vector rP 1 into ln sub-vectors ri P 1 of size m, and perform DCT-II transform (Q T m ri P 1 ) on each sub-vector to obtain transformed vector r.4: Permute vector r by applying permutation P 2 , which orders elements as 1, mn + 1, 2mn + 1, . . ., (l − 1)mn + 1, 2, mn + 2, 2mn + 2, . . ., (l − 1)mn + 2, . . ., mn, mn + mn, 2mn + mn, . . ., (l − 1)mn +mn, in order to obtain vector rP 2 .5: Calculate elements of matrices Tn,m , and solve the mn tridiagonal systems, in order to obtain vector zP 2 .6: Apply inverse permutation P T 2 on vector zP 2 so as to obtain vector z.7: Partition vector z into ln sub-vectors zi of size m, and perform IDCT-II transform (Q m zi ) on each sub-vector to obtain vector zP 1 .8: Partition vector zP 1 into l sub-vectors zi P 1 of size mn, and apply inverse permutation P T on each sub-vector to obtain vector z.9: Partition vector z into lm sub-vectors zi of size n, and perform IDCT-II transform (Q n z i ) on each sub-vector to obtain final solution vector z.

Methodology for Full Chip Thermal Analysis
This section applies the theoretical background that was analyzed before for the computation of the temperatures across the chip.The complete methodology consists of the following steps: • 3D discretization of the chip: The spatial steps ∆x, ∆y in the xand y-direction are user defined, but the step ∆z along the z-direction is typically chosen to coincide with the interface between successive layers (metal and insulator).The discretization procedure naturally covers multiple layers in the z-direction and can be easily extended to model heterogeneous structures that can be found in modern chips (e.g., heat sinks).

•
Construction of equivalent electrical circuit: The RC elements of the electrical equivalent are calculated by ( 7) and ( 8).

•
Estimation of the power consumption profile of chip logic blocks.This determines the location and the time behavior of heat sources and the value of current sources (9) that constitute the vector u(t) in ( 10).

•
Formulation of equivalent circuit description: Using modified nodal analysis, the equivalent circuit is described by the ODE system (10).
Construction of the preconditioner matrix: Based on the algorithm described in the previous section, the preconditioner matrix is constructed based on [24], and the preconditioner-solve step is performed with the fast transform solver.More specifically, the thermal grid is equivalent to a highly regular resistive network, as depicted in Figure 2, with resistive branches connecting nodes in the x, y, and z axis.To create a preconditioner that will approximate the grid matrix, we substitute each horizontal and vertical thermal conductance with its average value in the corresponding layer.Moreover, we substitute each thermal conductance connecting nodes in adjacent layers (z axis) with their average value between the two layers.

•
Compute either the DC or transient solution: The solution is obtained with the iterative PCG method in both cases.Note that in the case of the transient solution, the backward-Euler numerical integration method as in ( 11) is employed for the calculation of temperature in each discrete time instant.The convergence of the method is accelerated with the highly parallel fast transform solver that is used in preconditioner-solve step in Algorithm 1.The proposed methodology offers significant advantages over the established thermal simulation methods.Firstly, iterative methods can handle large-scale problems in contrast to direct methods that do not scale well with the matrix dimension and can only be applied to a narrow range of problems.Furthermore, the fast preconditioned solution step exhibits near-optimal computational complexity, low memory requirements, and great potential for parallelism, which can harness the computational power of parallel architectures, such as multicore processors or GPUs, thus further reducing the amount of time required for simulation.

Experimental Results
Due to the lack of availability of benchmarks for full chip thermal analysis and in order to evaluate the efficiency of the proposed methodology for thermal simulation, we have created a set of artificial benchmark circuits that represent simplified microprocessor designs (like MIPS and LEON) with a random control logic and datapath, based on the theory described in Section 3. The technology node that was used for this work was 32 nm.Similarly, one can form the linear set of equations for different technology-specific parameters of Equations ( 7) and (8).Despite the fact that the benchmarks were artificially created, the problems that would arise from a real design would be represented by a similar set of linear equations.
In Table 2 discretization points are the number of discretizations in each axis, while layers are the number of layers that each benchmark has.Moreover, the matrix dimensions can be calculated by multiplying the square of the discretization points by the number of layers.All experiments were executed on a Linux workstation, comprised of an Intel Core i7 processor running at 2.4 GHz (six cores and 24 GB main memory) and an NVIDIA Tesla C2075 GPU with 6 GB of main memory.We have used the CUDA library [28] (Version 5.5, along with the CUBLAS, CUSPARSE, and CUFFT libraries) for mapping the proposed Fast-Transform PCG (FT-PCG) algorithm on the GPU.The ICCG was executed on the CPU since it would not be beneficial to port it on the GPU due many irregular memory transfers.The convergence tolerance (tol) for iterative solvers was set to 10 −6 , which is typically sufficient to yield perfect accuracy, and convergence was achieved in all cases.Table 2 presents the results from the evaluation of the aforementioned methods on the set of benchmark circuits.The number Iter. is the number of iterations that the algorithm needed to converge, while Time (s) refers to the time in seconds that were needed to compute the final solution.Both the number of iterations and time represent either the DC solution of the system or the average iterations and time of a time instant in transient analysis.
Table 2. Runtime results for the three solvers.Bench. is the name of the benchmark, Discr.Point. is the number of points that correspond to the x and y axis, Layers is the number of layers of each chip and corresponds to the z axis, Time (s) denotes the average time required for the solution at each iteration, Iter. is the average number of iterations required for the convergence of each iterative method, while Speedup denotes the speedup of our method over the ICCG.Comparing the iterative methods, it can be observed that the proposed method was able to reduce the number of iterations required for convergence greatly, as shown in Figure 3a.Compared with general purpose preconditioning methods such as ICCG, the proposed preconditioners take into account the topology characteristics of the thermal grid.As a result, they are able to approximate it faithfully enough and reduce the required number of iterations.Moreover, owing to their inherent parallelism, the proposed preconditioners can utilize the vast amount of computational resources found in massively-parallel architectures, such as GPUs.Thus, their efficiency is increased with the increasing circuit size, by greatly reducing the runtime for each time-step, as depicted in Figure 3b.FT-PCG was able to achieve a speed-up ranging between 1.5× and 2.2× in CPU execution and 16× and 26.93× in GPU execution over ICCG.

Conclusions
In this paper, we presented a fast thermal simulation method based on the RC equivalent, which uses fast transform solvers and preconditioned Krylov-subspace iterative solvers.The preconditioned iterative solvers offer linear scaling in simulation time as the thermal grid is increased.Experimental evaluation of the proposed method on a set of thermal benchmarks with the size ranging from 0.15 M-10 M nodes showed that the proposed methodology achieved a speedup ranging between 15.72× and 26.93× over a preconditioned iterative method with an incomplete Cholesky factorization preconditioner when GPUs are utilized.

Figure 1 .
Figure 1.Spatial discretization of a chip for thermal analysis and the formulation of the electrical equivalent problem.

Figure 2 .
Figure 2. Example of a 3D thermal grid that is used for preconditioning.

Figure 3 .
(a) Average number of iterations, (b) Average runtimes for each time-step in each benchmark.
Discrete Cosine Transform of Type-II IDCT-II Inverse Discrete Cosine Transform of Type-II FFT Fast Fourier Transform FT-PCG Fast Transform Preconditioned Conjugate Gradient

Table 1 .
Analogy between electrical and thermal circuits.