Open Access
This article is
 freely available
 reusable
Technologies 2019, 7(1), 1; https://doi.org/10.3390/technologies7010001
Article
A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms †
^{1}
Department of Electrical & Computer Engineering, University of Thessaly, 38221 Volos, Greece
^{2}
Helic Inc., 2350 Mission College Boulevard, Suite 495, Santa Clara, CA 95054, USA
*
Correspondence: gefloros@ece.uth.gr; Tel.: +302421074979
^{†}
This paper is an extended version of our paper published in the Proceedings of the 7th International Conference on Modern Circuit and System Technologies on Electronics and Communications (MOCAST 2018), Thessaloniki, Greece, 7–9 May 2018.
Received: 1 November 2018 / Accepted: 17 December 2018 / Published: 20 December 2018
Abstract
:Efficient fullchip thermal simulation is among the most challenging problems facing the EDA industry today, especially for modern 3D integrated circuits, due to the huge linear systems resulting from thermal modeling approaches that require unreasonably long computational times. While the formulation problem, by applying a thermal equivalent circuit, is prevalent and can be easily constructed, the corresponding 3D equations network has an undesirable timeconsuming numerical simulation. Direct linear solvers are not capable of handling such huge problems, and iterative methods are the only feasible approach. In this paper, we propose a computationallyefficient iterative method with a parallel preconditioned technique that exploits the resources of massivelyparallel architectures such as Graphic Processor Units (GPUs). Experimental results demonstrate that the proposed method achieves a speedup of 2.2× in CPU execution and a 26.93× speedup in GPU execution over the stateoftheart iterative method.
Keywords:
thermal analysis; integrated circuits; electronic design automation1. Introduction
The evolution of the manufacturing technology of Integrated Circuits (ICs) has continued unabated over the past fifty years, according to the predictions of Moore’s law and has led to extremely complex circuits (modern processors contain several billion transistors and are easily the most complex human construction), but also to analogous escalation of the problems related to the analysis and simulation of such circuits. Therefore, thermal analysis is one of the most critical challenges arising from the technological evolution. The continuous effort for smaller sizes, in the sub45nm era, and greater performance, as well as the new 3D structures have begun to outpace the ability of heat sinks to dissipate the onchip power.
In particular, aggravation of thermal effects is an inevitable consequence of the continuous scaling trend. High temperature has a significant impact on chip performance and functionality, leading to slower transistor speed, more leakage power, higher interconnect resistance, and reduced reliability [1]. The problem becomes more pronounced in modern technologies due to the multilayer 3D stacking, and the use of new device technologies, like FinFETsand Silicon on Insulator (SOI), which are more sensitive to the selfheating effect [2]. Furthermore, as heat generation is nonuniform, local hotspots and spatial gradients appear. Stacking multiple layers in a 3D chip promises density and performance enhancement. However, it requires extensive thermal analysis as the power density and temperature of these architectures can be quite high. For the above reasons, full chip thermal analysis is a vital, but extremely difficult problem due to the size of the systems that need to be solved for multiple time points and remains a key issue for future microprocessors and ICs [3,4]. Due to this fact, IC thermal analysis problems have drawn considerable attention over the past two decades. To deal with these challenges, prior approaches have focused on the formulation of the problem and the fast steadystate and transient thermal simulation in order to compute the temperature across the whole chip.
Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems. Unfortunately, these methods do not scale well with the dimension of the linear system and are prohibitively expensive, while the thermal problems are becoming larger, in both execution time and memory requirements. On the other hand, iterative Krylovsubspace methods such as Conjugate Gradients (CG) involve only inner products and matrixvector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally and memoryefficient.
Moving beyond conventional direct solvers, our early work in [5] introduced an approach for full chip thermal analysis that is based on the Finite Difference Method (FDM) or the formulation of an RC equivalent electrical network in conjunction with a highly parallel iterative Krylovsubspace Preconditioned Conjugate Gradient (PCG) method, which overcomes the computational demands for the very large systems arising from the thermal modeling. In particular, the contributions of this paper to the problem of thermal analysis are:
 Accelerated solution of thermal grids: The proposed thermal simulator uses FDM with preconditioned CG, which is wellsuited, offers faster solution times, and uses less memory than sparse direct solvers.
 Highly parallel preconditioned mechanism: The specialized structures of thermal grids allow the usage of fast transform solvers as a preconditioned mechanism in the CG method, which are highly parallel and can be easily ported to GPUs.
 Fast convergence to the solution: Fast transform solvers can handle different matrix blocks, which offers a good preconditioner approximation. This results in considerably more accurate preconditioners that can approximate thermal grids and make CG converge to the final solution in a few iterations.
Experimental results demonstrate that our method achieves speedups of around 20× on GPU and around 2× on CPU for a 10 M node thermal grid over a stateoftheart iterative method, like Incomplete Cholesky Preconditioned Conjugate Gradient (ICCG) on a CPU.
The rest of the paper, is organized as follows. Section 2 describes the related work on the thermal simulation problem. Section 3 introduces the thermal model that was used in the present work. Section 4 provides a brief description of the 3D fast transform solver. Section 5 describes the proposed approach, combining the methods presented in the two previous sections. Finally, Section 6 presents the results and a discussion about the advantages of the method, followed by the conclusions in Section 7.
2. Related Work
The growing need to simulate largescale thermal models in technology nodes below 45 nm has led to some important research in the fast thermal estimation of the IC chips. In this section, we briefly review some of these methods. Most transient thermal analysis methodologies have so far relied on solving the entire system, using different modeling techniques, based mainly on the Finite Element Method (FEM), the Finite Difference Method (FDM), and Green’s functions. The research work in [6,7] adopted the FDM method, with a multigrid approach in order to speed up the simulation process, and the FDM method with temporal and spatial adaptation to further accelerate thermal analysis was proposed in [8,9]. Similarly, in [10], the fullchip thermal transient equations were solved in a similar manner using an Alternating Direction Implicit (ADI) method for enhanced computational efficiency. Furthermore, in [11,12], the FDM approach and the RCequivalent were used along with modeling of the fluids for microcooling 3D structures. In [13], FEM was adopted for 2D and 3D geometries along with a multigrid preconditioning method and automatic mesh generation for chip geometries. Finally, Green’s functions were used in [14] with discrete cosine transform and its inversion in order to accelerate the numerical computation of the homogeneous and inhomogeneous solution. However, these method are efficient for a limited range of problems, since they have limited potential for parallelism.
Besides the previous conventional approaches, different methods like a Neural Net (NN) approach were used in [15], but since it was based in predictions, it did not always provide an accurate solution to the crucial problem of thermal analysis. Moreover, a Look Up Table (LUT) method based on the power thermal relation, which develops a doublemesh scheme to capture thermal characteristics and store the results in library files, was presented in [16]. However, currently, chips can lead to huge library files due to the highly complex combined heat maps. Furthermore, the reduction of the problem size, through a Model Order Reduction (MOR) process, was proposed in [17,18], which can be useful in addressing the performance of individual devices, but in some cases, it is not enough to address all the reliability issues.
Finally, the authors in [19] provided a parallel iterative Generalized Minimal RESidual (GMRES) method for FDM and microcooling problems, but without any special preconditioning approach. Clearly, the concept of a dedicated fullyparallel preconditioned technique has not yet been introduced in the context of transient thermal analysis.
3. OnChip Thermal Modeling and Analysis
There are three modes of heat transfer: conduction, convection, and radiation. The primary mechanism of heat transfer in solids is by conduction, and the others can be neglected. The starting point for thermal analysis is Fourier’s law of heat conduction [20]:
which states that the vector of heat flux density $\mathbf{q}$ (heat flow per unit area and unit time) is proportional to the negative gradient of temperature T at every spacial point $\mathbf{r}={[x,y,z]}^{T}$ and time t, where ${k}_{t}$ is the thermal conductivity of the material.
$$\mathbf{q}(\mathbf{r},t)={k}_{t}\nabla T(\mathbf{r},t)$$
The conservation of energy also states that the divergence of the heat flux $\mathbf{q}$ equals the difference between the power generated by external heat sources and the rate of change of temperature, i.e.,
where $g(\mathbf{r},t)$ is the power density of the heat sources, ${c}_{p}$ is the specific heat capacity of the material, and $\rho $ is the density of the material. By combining (1) and (2), we have:
which may be rewritten as the following parabolic Partial Differential Equation (PDE):
(normally accompanied by appropriate boundary conditions [21]).
$$\begin{array}{c}\hfill \nabla \xb7\mathbf{q}(\mathbf{r},t)=g(\mathbf{r},t)\rho {c}_{p}\frac{\partial T(\mathbf{r},t)}{\partial t}\end{array}$$
$$\begin{array}{c}\hfill {k}_{t}{\nabla}^{2}T(\mathbf{r},t)=g(\mathbf{r},t)\rho {c}_{p}\frac{\partial T(\mathbf{r},t)}{\partial t}\end{array}$$
$$\begin{array}{c}\hfill \rho {c}_{p}\frac{\partial T(\mathbf{r},t)}{\partial t}={k}_{t}{\nabla}^{2}T(\mathbf{r},t)+g(\mathbf{r},t)\\ \hfill ={k}_{t}(\frac{{\partial}^{2}T(\mathbf{r},t)}{\partial {x}^{2}}+\frac{{\partial}^{2}T(\mathbf{r},t)}{\partial {y}^{2}}+\frac{{\partial}^{2}T(\mathbf{r},t)}{\partial {z}^{2}})+g(\mathbf{r},t)\end{array}$$
A common procedure for the numerical solution of (4) is by discretization along the three spatial coordinates with steps $\Delta x,\Delta y$, and $\Delta z$ and substitution of the spatial secondorder derivatives by finite difference approximations, leading to the following expression for temperature ${T}_{i,j,k}$ at each discrete point $(i,j,k)$ in relation to its neighboring points:
or by multiplying by $\Delta x\Delta y\Delta z$:
$$\begin{array}{c}\hfill \rho {c}_{p}\frac{d{T}_{i,j,k}}{dt}={k}_{t}\frac{{T}_{i+1,j,k}2{T}_{i,j,k}+{T}_{i1,j,k}}{\Delta {x}^{2}}\\ \hfill +{k}_{t}\frac{{T}_{i,j+1,k}2{T}_{i,j,k}+{T}_{i,j1,k}}{\Delta {y}^{2}}\\ \hfill +{k}_{t}\frac{{T}_{i,j,k+1}2{T}_{i,j,k}+{T}_{i,j,k1}}{\Delta {z}^{2}}+{g}_{i,j,k}\end{array}$$
$$\begin{array}{c}\hfill \rho {c}_{p}(\Delta x\Delta y\Delta z)\frac{d{T}_{i,j,k}}{dt}\\ \hfill {k}_{t}\frac{\Delta y\Delta z}{\Delta x}({T}_{i+1,j,k}2{T}_{i,j,k}+{T}_{i1,j,k})\\ \hfill {k}_{t}\frac{\Delta x\Delta z}{\Delta y}({T}_{i,j+1,k}2{T}_{i,j,k}+{T}_{i,j1,k})\\ \hfill {k}_{t}\frac{\Delta x\Delta y}{\Delta z}({T}_{i,j,k+1}2{T}_{i,j,k}+{T}_{i,j,k1})\\ \hfill ={g}_{i,j,k}(\Delta x\Delta y\Delta z)\end{array}$$
There is a wellknown analogy between thermal and electrical conduction, where temperature corresponds to voltage and heat flow corresponds to current (see Table 1).
In light of this analogy, Equation (6) has a direct correspondence to an electrical circuit where there is a node at every discrete point or cell in the thermal grid (see Figure 1). Every circuit node is connected to spatiallyneighboring nodes via conductances in the directions x, y, z with values:
and there is a capacitance to ground at every node or thermal cell with value:
$${G}_{x}\equiv \frac{{k}_{t}\Delta y\Delta z}{\Delta x},{G}_{y}\equiv \frac{{k}_{t}\Delta x\Delta z}{\Delta y},{G}_{z}\equiv \frac{{k}_{t}\Delta x\Delta y}{\Delta z}$$
$$C\equiv \rho {c}_{p}(\Delta x\Delta y\Delta z)$$
The heat sources constitute input excitations and are modeled in the equivalent circuit as the current sources with values:
$${I}_{i,j,k}\equiv {g}_{i,j,k}(\Delta x\Delta y\Delta z)$$
The above current sources are connected at the specific points $(i,j,k)$ or circuit nodes where there is heat flow (i.e., power dissipation from the underlying chip logic blocks).
The resulting electrical equivalent circuit is described in the time domain, using the Modified Nodal Analysis (MNA) framework, by a system of Ordinary Differential Equations (ODE):
where $\mathbf{G}\in {\mathcal{R}}^{n\times n}$ is a symmetric and positive definite matrix of the conductances (7), $\mathbf{C}\in {\mathcal{R}}^{n\times n}$ is a diagonal matrix of cell capacitances (8), $\mathbf{x}\in {\mathcal{R}}^{n}$ is the vector of unknown temperatures ${T}_{i,j,k}$ at all discretization points (constituting internal states of the system), and $\mathbf{u}\in {\mathcal{R}}^{p}$ is the vector of input excitations from the current sources ${I}_{i,j,k}$ of (9).
$$\mathbf{G}\mathbf{x}\left(t\right)+\mathbf{C}\frac{d\mathbf{x}\left(t\right)}{dt}=\mathbf{u}\left(t\right)$$
For transient simulation, we can discretize the time interval into time instants ${t}_{k}$, $k=1,2,\cdots $ and use the backwardEuler numerical integration method for the calculation of temperature in each discrete time instant ${t}_{k}$:
where ${h}_{k}={t}_{k}{t}_{k1}$, $k=1,2,\cdots $ is the time step of time ${t}_{k}$ (which may in general vary during transient analysis). The above equation involves the solution of a very large sparse linear system in each time instant ${t}_{k}$.
$$(\mathbf{G}+\frac{\mathbf{C}}{{h}_{k}})\mathbf{x}\left({t}_{k}\right)=\frac{\mathbf{C}}{{h}_{k}}\mathbf{x}\left({t}_{k1}\right)+\mathbf{u}\left({t}_{k}\right)$$
Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems. Unfortunately, these methods do not scale well with the dimension of the linear system and become prohibitively expensive for largescale networks. Iterative methods involve only inner products and matrixvector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally and memory efficient. In this work, we employ iterative methods for the solution of largescale networks arising from the 3D discretization of the chip for thermal analysis. The system matrices arising from the modeling of the thermal grid can also be shown to be Symmetric and Positive Definite (SPD), which allows the use of the efficient method of the Conjugate Gradient (CG) for the solution of the corresponding linear systems. The CG method is well known [22], and its implementation is shown in Algorithm 1.
Algorithm 1 Preconditioned conjugate gradient. 

Regarding the convergence rate of CG, it can be shown [23] that the required number of iterations (for a given initial guess and convergence tolerance) is bounded in terms of the spectral condition number ${k}_{2}\left(\mathbf{A}\right)={\left\right\mathbf{A}\left\right}_{2}\left\right{\mathbf{A}}^{1}{\left\right}_{2}\ge 1$; specifically, it is $O\left(\sqrt{{k}_{2}\left(\mathbf{A}\right)}\right)$, which for SPD matrices becomes ${k}_{2}\left(\mathbf{A}\right)=\frac{{\lambda}_{max}\left(\mathbf{A}\right)}{{\lambda}_{min}\left(\mathbf{A}\right)}$ where ${\lambda}_{max}\left(\mathbf{A}\right)$, ${\lambda}_{min}\left(\mathbf{A}\right)$ are the maximum and minimum eigenvalues of $\mathbf{A}$, respectively. This means that convergence of CG is fast when ${k}_{2}\left(\mathbf{A}\right)\approx 1$ and slow when ${k}_{2}\left(\mathbf{A}\right)>>1$.
To improve the convergence speed, it is necessary to apply a preconditioning mechanism, which transforms the initial linear system into an equivalent one with a more favorable spectral condition number. The socalled preconditioner is a matrix $\mathbf{M}$ that approximates $\mathbf{A}$ in some way, such that the transformed system ${\mathbf{M}}^{1}\mathbf{Ax}={\mathbf{M}}^{1}\mathbf{b}$ (which obviously has the same solution as the initial $\mathbf{Ax}=\mathbf{b}$) exhibits condition number ${k}_{2}\left({\mathbf{M}}^{\mathbf{1}}\mathbf{A}\right)={k}_{2}\left(\mathbf{I}\right)=1$. In practice, it is not necessary to invert the preconditioner $\mathbf{M}$ and apply it directly at the system $\mathbf{Ax}=\mathbf{b}$. It can be shown that the same thing can be accomplished by introducing an extra computational step within the iterative method, which entails solving a system $\mathbf{Mz}=\mathbf{r}$ with known RightHand Side (RHS) vector $\mathbf{r}$ and unknown vector $\mathbf{z}$ in every iteration [23].
From the above, it follows that a good preconditioner $\mathbf{M}$ must satisfy two key properties:
 The fast convergence rate of the preconditioned.
 A linear system involving $\mathbf{M}$ is solved much more efficiently than the original system that involves $\mathbf{A}$.
4. Fast Transform Preconditioners for 3D Thermal Networks
Recent implementations of fast transform solvers have shown great potential for the solution of blocktridiagonal systems with a special structure [24,25]. This section describes such an algorithm for the solution of an appropriate preconditioner system $\mathbf{M}\mathbf{z}=\mathbf{r}$ by the use of a fast transform solver in a nearoptimal number of operations.
Let $\mathbf{M}$ be an $N\times N$ blocktridiagonal matrix with l diagonal blocks of size $mn\times mn$ each (overall $N=lmn$), where l is very small (typically 5–8 depending on the material layers (metal and insulator) of the chip), with the following form:
where ${\mathbf{I}}_{\mathbf{mn}}$ is the $mn\times mn$ identity matrix and ${\mathbf{M}}_{i}$, $i=1,\cdots ,l$, are block tridiagonal $mn\times mn$ matrices of the form:
where ${\mathbf{I}}_{n}$ is the $n\times n$ identity matrix and ${\mathbf{T}}_{i}$, $i=1,\cdots ,m$ are $n\times n$ tridiagonal matrices with the following form:
$$\begin{array}{cc}\hfill \mathbf{M}& =\left[\begin{array}{ccccc}{\mathbf{M}}_{1}& {\delta}_{1}{\mathbf{I}}_{mn}& & & \\ {\delta}_{1}{\mathbf{I}}_{mn}& {\mathbf{M}}_{2}& {\delta}_{2}{\mathbf{I}}_{mn}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l2}{\mathbf{I}}_{mn}& {\mathbf{M}}_{l1}& {\delta}_{l1}{\mathbf{I}}_{mn}\\ & & & {\delta}_{l1}{\mathbf{I}}_{mn}& {\mathbf{M}}_{l}\end{array}\right]\hfill \end{array}$$
$$\begin{array}{cc}\hfill {\mathbf{M}}_{i}& =\left[\begin{array}{ccccc}{\mathbf{T}}_{i}+{\gamma}_{i}{\mathbf{I}}_{n}& {\gamma}_{i}{\mathbf{I}}_{n}& & & \\ {\gamma}_{i}{\mathbf{I}}_{n}& {\mathbf{T}}_{i}+2{\gamma}_{i}{\mathbf{I}}_{n}& {\gamma}_{i}{\mathbf{I}}_{n}& & \\ & \xb7& \xb7& \xb7& \\ & & {\gamma}_{i}{\mathbf{I}}_{n}& {\mathbf{T}}_{i}+{\gamma}_{i}{\mathbf{I}}_{n}& {\gamma}_{i}{\mathbf{I}}_{n}\\ & & & {\gamma}_{i}{\mathbf{I}}_{n}& {\mathbf{T}}_{i}+{\gamma}_{i}{\mathbf{I}}_{n}\end{array}\right]\hfill \end{array}$$
$$\begin{array}{cc}\hfill {\mathbf{T}}_{i}& =\left[\begin{array}{ccccc}{\alpha}_{i}+{\beta}_{i}& {\alpha}_{i}& & & \\ {\alpha}_{i}& 2{\alpha}_{i}+{\beta}_{i}& {\alpha}_{i}& & \\ & \xb7& \xb7& \xb7& \\ & & {\alpha}_{i}& 2{\alpha}_{i}+{\beta}_{i}& {\alpha}_{i}\\ & & & {\alpha}_{i}& {\alpha}_{i}+{\beta}_{i}\end{array}\right]=\hfill \end{array}$$
$${\alpha}_{i}\left[\begin{array}{ccccc}1& 1& & & \\ 1& 2& 1& & \\ & \xb7& \xb7& \xb7& \\ & & 1& 2& 1\\ & & & 1& 1\end{array}\right]+{\beta}_{i}\mathbf{I}$$
This class of tridiagonal matrices has a beforehand known eigendecomposition. Specifically, it can be shown [26] that each ${\mathbf{T}}_{i}$ has n distinct eigenvalues ${\lambda}_{i,j},j=1,\cdots ,n$, which are given by:
and a set of n orthonormal eigenvectors ${\mathbf{q}}_{j},j=1,\cdots ,n$, with elements:
$$\begin{array}{c}\hfill {\lambda}_{i,j}={\beta}_{i}+4{\alpha}_{i}si{n}^{2}(\frac{(j1)\pi}{2n})={\beta}_{i}+{\alpha}_{i}(2cos(\frac{(j1)\pi}{n})2)\end{array}$$
$${q}_{j,k}=\left\{\begin{array}{cc}\sqrt{\frac{1}{n}}cos\frac{(2k1)(j1)\pi}{2n},\hfill & \phantom{\rule{1.em}{0ex}}j=1,\phantom{\rule{1.em}{0ex}}k=1,\cdots ,n\hfill \\ \sqrt{\frac{2}{n}}cos\frac{(2k1)(j1)\pi}{2n},\hfill & \phantom{\rule{1.em}{0ex}}j=2,\cdots ,n,\phantom{\rule{1.em}{0ex}}k=1,\cdots ,n\hfill \end{array}\right.$$
Note that, the eigenvectors do not depend on the values ${\alpha}_{i}$ and ${\beta}_{i}$ and are the same for every matrix ${\mathbf{T}}_{i}$. If ${\mathbf{Q}}_{n}=[{\mathbf{q}}_{1},\cdots ,{\mathbf{q}}_{n}]$ denotes the matrix whose columns are the eigenvectors ${\mathbf{q}}_{j}$, then due to the eigendecomposition of ${\mathbf{T}}_{i}$, we have ${\mathbf{Q}}_{n}^{T}{\mathbf{T}}_{i}{\mathbf{Q}}_{n}={\mathsf{\Lambda}}_{i}=diag({\lambda}_{i,1},\cdots ,{\lambda}_{i,n})$. By exploiting the diagonalization of the matrix ${\mathbf{T}}_{i}$ and considering that ${\mathbf{Q}}_{n}^{T}{\mathbf{Q}}_{n}=\mathbf{I}$, the system $\mathbf{Mz}=\mathbf{r}$ is equivalent to the following system:
where:
and ${\mathsf{\Lambda}}_{i}^{\left(1\right)}=diag({\lambda}_{i,1}^{\left(1\right)},\cdots ,{\lambda}_{i,n}^{\left(1\right)})$ and ${\mathsf{\Lambda}}_{i}^{\left(2\right)}=diag({\lambda}_{i,1}^{\left(2\right)},\cdots ,{\lambda}_{i,n}^{\left(2\right)})$ are diagonal matrices with the eigenvalues of ${\mathbf{T}}_{i}+{\gamma}_{i}{\mathbf{I}}_{n}$, ${\mathbf{T}}_{i}+2{\gamma}_{i}{\mathbf{I}}_{n}$, which are the following:
$$\begin{array}{cc}\hfill \left[\begin{array}{ccc}{\mathbf{Q}}_{n}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}^{T}\end{array}\right]\mathbf{M}\left[\begin{array}{ccc}{\mathbf{Q}}_{n}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}\end{array}\right]& \left[\begin{array}{ccc}{\mathbf{Q}}_{n}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}^{T}\end{array}\right]\mathbf{z}\hfill \\ \hfill =\left[\begin{array}{ccc}{\mathbf{Q}}_{n}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}^{T}\end{array}\right]& \mathbf{r}\iff \hfill \end{array}$$
$$\left[\begin{array}{ccccc}{\tilde{\mathbf{M}}}_{1}& {\delta}_{1}{\mathbf{I}}_{mn}& & & \\ {\delta}_{1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{2}& {\delta}_{2}{\mathbf{I}}_{mn}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l2}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{l1}& {\delta}_{l1}{\mathbf{I}}_{mn}\\ & & & {\delta}_{l1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{l}\end{array}\right]\tilde{\mathbf{z}}=\tilde{\mathbf{r}}$$
$${\tilde{\mathbf{M}}}_{\mathbf{i}}=\left[\begin{array}{ccccc}{\mathsf{\Lambda}}_{i}^{\left(1\right)}& {\gamma}_{i}\mathbf{I}& & & \\ {\gamma}_{i}\mathbf{I}& {\mathsf{\Lambda}}_{i}^{\left(2\right)}& {\gamma}_{i}\mathbf{I}& & \\ & \xb7& \xb7& \xb7& \\ & & {\gamma}_{i}\mathbf{I}& {\mathsf{\Lambda}}_{i}^{\left(2\right)}& {\gamma}_{i}\mathbf{I}\\ & & & {\gamma}_{i}\mathbf{I}& {\mathsf{\Lambda}}_{i}^{\left(1\right)}\end{array}\right]$$
$$\tilde{\mathbf{z}}=\left[\begin{array}{ccc}{\mathbf{Q}}_{n}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}^{T}\end{array}\right]\mathbf{z},\phantom{\rule{1.em}{0ex}}\tilde{\mathbf{r}}=\left[\begin{array}{ccc}{\mathbf{Q}}_{n}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{n}^{T}\end{array}\right]\mathbf{r}$$
$$\begin{array}{c}\hfill {\lambda}_{i,j}^{\left(1\right)}={\gamma}_{i}+{\beta}_{i}+{\alpha}_{i}(2cos(\frac{(j1)\pi}{n})2)\phantom{\rule{1.em}{0ex}}j=1,\cdots ,n,\\ \hfill {\lambda}_{i,j}^{\left(2\right)}=2{\gamma}_{i}+{\beta}_{i}+{\alpha}_{i}(2cos(\frac{(j1)\pi}{n})2)\phantom{\rule{1.em}{0ex}}j=1,\cdots ,n,\end{array}$$
If the $N\times 1$ vectors $\mathbf{r}$, $\mathbf{z}$, $\tilde{\mathbf{r}}$, $\tilde{\mathbf{z}}$ are also partitioned into m blocks of size $n\times 1$ each, i.e.,
then we have: ${\tilde{\mathbf{r}}}_{i}={\mathbf{Q}}_{n}^{T}{\mathbf{r}}_{i}$ and ${\tilde{\mathbf{z}}}_{i}={\mathbf{Q}}_{n}^{T}{\mathbf{z}}_{i}\iff {\mathbf{z}}_{i}={\mathbf{Q}}_{n}{\tilde{\mathbf{z}}}_{i},\phantom{\rule{0.277778em}{0ex}}i=1,\cdots ,m$.
$$\mathbf{r}=\left[\begin{array}{c}{\mathbf{r}}_{1}\\ \vdots \\ {\mathbf{r}}_{m}\end{array}\right],\phantom{\rule{1.em}{0ex}}\mathbf{z}=\left[\begin{array}{c}{\mathbf{z}}_{1}\\ \vdots \\ {\mathbf{z}}_{m}\end{array}\right],\phantom{\rule{1.em}{0ex}}\tilde{\mathbf{r}}=\left[\begin{array}{c}{\tilde{\mathbf{r}}}_{1}\\ \vdots \\ {\tilde{\mathbf{r}}}_{m}\end{array}\right],\phantom{\rule{1.em}{0ex}}\tilde{\mathbf{z}}=\left[\begin{array}{c}{\tilde{\mathbf{z}}}_{1}\\ \vdots \\ {\tilde{\mathbf{z}}}_{m}\end{array}\right]$$
However, it can be shown [27] that each product ${\mathbf{Q}}_{n}^{T}{\mathbf{r}}_{i}={\tilde{\mathbf{r}}}_{i}$ corresponds to a Discrete Cosine Transform of TypeII (DCTII) on ${\mathbf{r}}_{i}$, and each product ${\mathbf{Q}}_{n}{\tilde{\mathbf{z}}}_{i}={\mathbf{z}}_{i}$ corresponds to an Inverse Discrete Cosine Transform of TypeII (IDCTII) on ${\tilde{\mathbf{z}}}_{i}$. This means that the computation of the whole vector $\tilde{\mathbf{r}}$ from $\mathbf{r}$ amounts to m independent DCTII transforms of size n, and the computation of the whole vector $\mathbf{z}$ from $\tilde{\mathbf{z}}$ amounts to m independent IDCTII transforms of size n. A modification of the Fast Fourier Transform (FFT) can be employed for each of the $lm$ independent DCTII/IDCTII transforms [27], giving a total nearoptimal operation count of $\mathcal{O}(mnlogn)=\mathcal{O}(Nlogn)$.
If now, $\mathbf{P}$ is a permutation matrix of size $mn\times mn$ that reorders the elements of a vector or the rows of a matrix as $1,n+1,\cdots ,(m1)n+1,2,n+2,\cdots ,(m1)n+2,\cdots ,n,n+n,\cdots ,(m1)n+n$ and ${\mathbf{P}}_{1}$, ${\mathbf{P}}_{1}^{T}$ are the blockdiagonal $lmn\times lmn$ permutation matrices ${\mathbf{P}}_{1}=diag(\mathbf{P},\cdots ,\mathbf{P})$, ${\mathbf{P}}_{1}^{T}=diag({\mathbf{P}}^{T},\cdots ,{\mathbf{P}}^{T})$, then the system at (18) is transformed into:
where ${\mathbf{D}}_{1}=diag({\tilde{\mathbf{T}}}_{i,1},\cdots ,{\tilde{\mathbf{T}}}_{i,n})$, $i=1,\cdots ,l$, with ${\tilde{\mathbf{T}}}_{i,j}$, $j=1,\cdots ,n$ being $m\times m$ tridiagonal matrices of the form:
and ${\tilde{\mathbf{z}}}^{{\mathbf{P}}_{1}}={\mathbf{P}}_{1}\tilde{\mathbf{z}}$, ${\tilde{\mathbf{r}}}^{{\mathbf{P}}_{1}}={\mathbf{P}}_{1}\tilde{\mathbf{r}}$. If ${\tilde{\mathsf{\Lambda}}}_{i,j}=diag({\tilde{\lambda}}_{i,j,1},\cdots ,{\tilde{\lambda}}_{i,j,m})$ is the diagonal matrix with the eigenvalues of ${\tilde{\mathbf{T}}}_{i,j}$, which are:
and ${\mathbf{Q}}_{m}$ is the common matrix of eigenvectors for all ${\tilde{\mathbf{T}}}_{i,j}$, then by similar reasoning as in (17), the system (21) is equivalent to:
where ${\tilde{\mathbf{D}}}_{i}=diag({\tilde{\mathsf{\Lambda}}}_{i,1},\cdots ,{\tilde{\mathsf{\Lambda}}}_{i,n})$ and:
$${\mathbf{P}}_{1}\left[\begin{array}{ccccc}{\tilde{\mathbf{M}}}_{1}& {\delta}_{1}{\mathbf{I}}_{mn}& & & \\ {\delta}_{1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{2}& {\delta}_{2}{\mathbf{I}}_{mn}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l2}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{l1}& {\delta}_{l1}{\mathbf{I}}_{mn}\\ & & & {\delta}_{l1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{M}}}_{l}\end{array}\right]{\mathbf{P}}_{1}^{T}{\mathbf{P}}_{1}\tilde{\mathbf{z}}={\mathbf{P}}_{1}\tilde{\mathbf{r}}\iff $$
$$\left[\begin{array}{ccccc}{\mathbf{D}}_{1}& {\delta}_{1}{\mathbf{I}}_{mn}& & & \\ {\delta}_{1}{\mathbf{I}}_{mn}& {\mathbf{D}}_{2}& {\delta}_{2}{\mathbf{I}}_{mn}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l2}{\mathbf{I}}_{mn}& {\mathbf{D}}_{l1}& {\delta}_{l1}{\mathbf{I}}_{mn}\\ & & & {\delta}_{l1}{\mathbf{I}}_{mn}& {\mathbf{D}}_{l}\end{array}\right]{\tilde{\mathbf{z}}}^{{\mathbf{P}}_{1}}={\tilde{\mathbf{r}}}^{{\mathbf{P}}_{1}}$$
$$\begin{array}{cc}\hfill {\tilde{\mathbf{T}}}_{i,j}& =\left[\begin{array}{ccccc}{\lambda}_{i,j}^{\left(1\right)}& {\gamma}_{i}& & & \\ {\gamma}_{i}& {\lambda}_{i,j}^{\left(2\right)}& {\gamma}_{i}& & \\ & \xb7& \xb7& \xb7& \\ & & {\gamma}_{i}& {\lambda}_{i,j}^{\left(2\right)}& {\gamma}_{i}\\ & & & {\gamma}_{i}& {\lambda}_{i,j}^{\left(1\right)}\end{array}\right]=\hfill \end{array}$$
$${\gamma}_{i}\left[\begin{array}{ccccc}1& 1& & & \\ 1& 2& 1& & \\ & \xb7& \xb7& \xb7& \\ & & 1& 2& 1\\ & & & 1& 1\end{array}\right]+({\beta}_{i}+{\alpha}_{i}(2cos(\frac{(j1)\pi}{n})2)){\mathbf{I}}_{m}$$
$$\begin{array}{c}\hfill {\tilde{\lambda}}_{i,j,k}={\gamma}_{i}(2cos(\frac{(k1)\pi}{n})2)+{\beta}_{i}+{\alpha}_{i}(2cos(\frac{(j1)\pi}{n})2),\phantom{\rule{1.em}{0ex}}k=1,\cdots ,m\end{array}$$
$$\left[\begin{array}{ccccc}{\tilde{\mathbf{D}}}_{1}& {\delta}_{1}{\mathbf{I}}_{mn}& & & \\ {\delta}_{1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{D}}}_{2}& {\delta}_{2}{\mathbf{I}}_{mn}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l2}{\mathbf{I}}_{mn}& {\tilde{\mathbf{D}}}_{l1}& {\delta}_{l1}{\mathbf{I}}_{mn}\\ & & & {\delta}_{l1}{\mathbf{I}}_{mn}& {\tilde{\mathbf{D}}}_{l}\end{array}\right]\tilde{\tilde{\mathbf{z}}}=\tilde{\tilde{\mathbf{r}}}$$
$$\tilde{\tilde{\mathbf{z}}}=\left[\begin{array}{ccc}{\mathbf{Q}}_{m}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{m}^{T}\end{array}\right]{\tilde{\mathbf{z}}}_{1}^{\mathbf{P}},\phantom{\rule{1.em}{0ex}}\tilde{\tilde{\mathbf{r}}}=\left[\begin{array}{ccc}{\mathbf{Q}}_{m}^{T}& & \\ & \ddots & \\ & & {\mathbf{Q}}_{m}^{T}\end{array}\right]{\tilde{\mathbf{r}}}_{1}^{\mathbf{P}}$$
In a similar way as previously, the $N\times 1$ vectors ${\tilde{\mathbf{z}}}^{{\mathbf{P}}_{1}}$, ${\tilde{\mathbf{r}}}^{{\mathbf{P}}_{1}}$, $\tilde{\tilde{\mathbf{z}}}$, and $\tilde{\tilde{\mathbf{r}}}$ can be partitioned into $ln$ subvectors of size $m\times 1$ each, and the DCTII and IDCTII are performed accordingly, giving a total nearoptimal operation count of $\mathcal{O}(lmnlogn)=\mathcal{O}(Nlogn)$.
If now, ${\mathbf{P}}_{2}$ is a permutation matrix of size $N\times N$ that reorders the elements of a vector or the rows of a matrix as $1,mn+1,2mn+1,\cdots ,(l1)mn+1,2,mn+2,2mn+2,\cdots ,(l1)mn+2,\cdots ,mn,mn+mn,2mn+mn,\cdots ,(l1)mn+mn$, and ${\mathbf{P}}_{2}^{T}$ is the inverse permutation matrix, then System (24) is equivalent to:
where $\tilde{\tilde{\mathbf{M}}}=diag({\tilde{\tilde{\mathbf{T}}}}_{1,1},{\tilde{\tilde{\mathbf{T}}}}_{1,2},\cdots ,{\tilde{\tilde{\mathbf{T}}}}_{1,m},{\tilde{\tilde{\mathbf{T}}}}_{2,1},{\tilde{\tilde{\mathbf{T}}}}_{2,2},\cdots ,{\tilde{\tilde{\mathbf{T}}}}_{2,m},\cdots {\tilde{\tilde{\mathbf{T}}}}_{n,m})$, with ${\tilde{\tilde{\mathbf{T}}}}_{j,k}$, $j=1,\cdots ,n$, $k=1,\cdots ,m$ being $l\times l$ tridiagonal matrices of the form:
and ${\tilde{\tilde{\mathbf{z}}}}^{{\mathbf{P}}_{2}}={\mathbf{P}}_{2}\tilde{\tilde{\mathbf{z}}}$, ${\tilde{\tilde{\mathbf{r}}}}^{{\mathbf{P}}_{2}}={\mathbf{P}}_{2}\tilde{\tilde{\mathbf{r}}}$.
$$\tilde{\tilde{\mathbf{M}}}{\tilde{\tilde{\mathbf{z}}}}^{{\mathbf{P}}_{2}}={\tilde{\tilde{\mathbf{r}}}}^{{\mathbf{P}}_{2}}$$
$$\begin{array}{cc}\hfill {\tilde{\tilde{\mathbf{T}}}}_{j,k}& =\left[\begin{array}{ccccc}{\tilde{\lambda}}_{1,i,k}& {\delta}_{1}& & & \\ {\delta}_{1}& {\tilde{\lambda}}_{2,i,k}& {\delta}_{2}& & \\ & \xb7& \xb7& \xb7& \\ & & {\delta}_{l1}& {\tilde{\lambda}}_{l1,i,k}& {\delta}_{l}\\ & & & {\delta}_{l}& {\tilde{\lambda}}_{l,i,k}\end{array}\right]\hfill \end{array}$$
Taking into account the above equations and by applying permutation matrices in order to reorder the elements of the $\mathbf{Mz}=\mathbf{r}$ system, a fast solution for the preconditioner solve step can be obtained as shown in Algorithm 2.
Algorithm 2 Preconditioner solution for the thermal grid. 

5. Methodology for Full Chip Thermal Analysis
This section applies the theoretical background that was analyzed before for the computation of the temperatures across the chip. The complete methodology consists of the following steps:
 3D discretization of the chip: The spatial steps $\Delta x$, $\Delta y$ in the x and ydirection are user defined, but the step $\Delta z$ along the zdirection is typically chosen to coincide with the interface between successive layers (metal and insulator). The discretization procedure naturally covers multiple layers in the zdirection and can be easily extended to model heterogeneous structures that can be found in modern chips (e.g., heat sinks).
 Formulation of equivalent circuit description: Using modified nodal analysis, the equivalent circuit is described by the ODE system (10).
 Construction of the preconditioner matrix: Based on the algorithm described in the previous section, the preconditioner matrix is constructed based on [24], and the preconditionersolve step is performed with the fast transform solver. More specifically, the thermal grid is equivalent to a highly regular resistive network, as depicted in Figure 2, with resistive branches connecting nodes in the x, y, and z axis. To create a preconditioner that will approximate the grid matrix, we substitute each horizontal and vertical thermal conductance with its average value in the corresponding layer. Moreover, we substitute each thermal conductance connecting nodes in adjacent layers (z axis) with their average value between the two layers.
 Compute either the DC or transient solution: The solution is obtained with the iterative PCG method in both cases. Note that in the case of the transient solution, the backwardEuler numerical integration method as in (11) is employed for the calculation of temperature in each discrete time instant. The convergence of the method is accelerated with the highly parallel fast transform solver that is used in preconditionersolve step in Algorithm 1.
The proposed methodology offers significant advantages over the established thermal simulation methods. Firstly, iterative methods can handle largescale problems in contrast to direct methods that do not scale well with the matrix dimension and can only be applied to a narrow range of problems. Furthermore, the fast preconditioned solution step exhibits nearoptimal computational complexity, low memory requirements, and great potential for parallelism, which can harness the computational power of parallel architectures, such as multicore processors or GPUs, thus further reducing the amount of time required for simulation.
6. Experimental Results
Due to the lack of availability of benchmarks for full chip thermal analysis and in order to evaluate the efficiency of the proposed methodology for thermal simulation, we have created a set of artificial benchmark circuits that represent simplified microprocessor designs (like MIPS and LEON) with a random control logic and datapath, based on the theory described in Section 3. The technology node that was used for this work was 32 nm. Similarly, one can form the linear set of equations for different technologyspecific parameters of Equations (7) and (8). Despite the fact that the benchmarks were artificially created, the problems that would arise from a real design would be represented by a similar set of linear equations.
In Table 2 discretization points are the number of discretizations in each axis, while layers are the number of layers that each benchmark has. Moreover, the matrix dimensions can be calculated by multiplying the square of the discretization points by the number of layers. All experiments were executed on a Linux workstation, comprised of an Intel Core i7 processor running at 2.4 GHz (six cores and 24 GB main memory) and an NVIDIA Tesla C2075 GPU with 6 GB of main memory. We have used the CUDA library [28] (Version 5.5, along with the CUBLAS, CUSPARSE, and CUFFT libraries) for mapping the proposed FastTransform PCG (FTPCG) algorithm on the GPU. The ICCG was executed on the CPU since it would not be beneficial to port it on the GPU due many irregular memory transfers. The convergence tolerance ($tol$) for iterative solvers was set to ${10}^{6}$, which is typically sufficient to yield perfect accuracy, and convergence was achieved in all cases. Table 2 presents the results from the evaluation of the aforementioned methods on the set of benchmark circuits. The number Iter. is the number of iterations that the algorithm needed to converge, while Time (s) refers to the time in seconds that were needed to compute the final solution. Both the number of iterations and time represent either the DC solution of the system or the average iterations and time of a time instant in transient analysis.
Comparing the iterative methods, it can be observed that the proposed method was able to reduce the number of iterations required for convergence greatly, as shown in Figure 3a. Compared with general purpose preconditioning methods such as ICCG, the proposed preconditioners take into account the topology characteristics of the thermal grid. As a result, they are able to approximate it faithfully enough and reduce the required number of iterations. Moreover, owing to their inherent parallelism, the proposed preconditioners can utilize the vast amount of computational resources found in massivelyparallel architectures, such as GPUs. Thus, their efficiency is increased with the increasing circuit size, by greatly reducing the runtime for each timestep, as depicted in Figure 3b. FTPCG was able to achieve a speedup ranging between 1.5× and 2.2× in CPU execution and 16× and 26.93× in GPU execution over ICCG.
7. Conclusions
In this paper, we presented a fast thermal simulation method based on the RC equivalent, which uses fast transform solvers and preconditioned Krylovsubspace iterative solvers. The preconditioned iterative solvers offer linear scaling in simulation time as the thermal grid is increased. Experimental evaluation of the proposed method on a set of thermal benchmarks with the size ranging from 0.15 M–10 M nodes showed that the proposed methodology achieved a speedup ranging between 15.72× and 26.93× over a preconditioned iterative method with an incomplete Cholesky factorization preconditioner when GPUs are utilized.
Author Contributions
Conceptualization, G.F. and N.E.; data curation, G.F.; funding acquisition, N.E. and G.S.; investigation, G.F.; methodology, G.F., K.D. and N.E.; project administration, N.E. and G.S.; software, K.D.; supervision, N.E. and G.S.; validation, G.F. and K.D.; writing, original draft, G.F.; writing, review and editing, G.F., K.D., N.E., and G.S.
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
EDA  Electronic Design Automation 
GPU  Graphic Processor Unit 
IC  Integrated Circuit 
SOI  Silicon on Insulator 
FDM  Finite Difference Method 
PCG  Preconditioned Conjugate Gradient 
ICCG  Incomplete Cholesky Conjugate Gradient 
FEM  Finite Element Method 
ADI  Alternating Direction Implicit 
NN  Neural Net 
LUT  Look Up Table 
MOR  Model Order Reduction 
GMRES  Generalized Minimal RESidual 
PDE  Partial Differential Equation 
LHS  LeftHand Side 
MNA  Modified Nodal Analysis 
ODE  Ordinary Differential Equations 
SPD  Symmetric and Positive Definite 
CG  Conjugate Gradient 
RHS  RightHand Side 
DCTII  Discrete Cosine Transform of TypeII 
IDCTII  Inverse Discrete Cosine Transform of TypeII 
FFT  Fast Fourier Transform 
FTPCG  Fast Transform Preconditioned Conjugate Gradient 
References
 Waldrop, M.M. The Chips Are Down for Moore’s Law. Nat. News 2016, 530, 144–147. [Google Scholar] [CrossRef] [PubMed]
 Xu, C.; Kolluri, S.K.; Endo, K.; Banerjee, K. Analytical Thermal Model for SelfHeating in Advanced FinFET Devices with Implications for Design and Reliability. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2013, 32, 1045–1058. [Google Scholar]
 SIA. International Technology Roadmap for Semiconductors (ITRS) 2015 EditionERD; SIA: Washington, DC, USA, 2015. [Google Scholar]
 Pedram, M.; Nazarian, S. Thermal modeling, analysis, and management in VLSI circuits: Principles and methods. Proc. IEEE 2006, 94, 1487–1501. [Google Scholar] [CrossRef]
 Floros, G.; Daloukas, K.; Evmorfopoulos, N.; Stamoulis, G. A parallel iterative approach for efficient full chip thermal analysis. In Proceedings of the 7th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 7–9 May 2018; pp. 1–4. [Google Scholar]
 Li, P.; Pileggi, L.T.; Asheghi, M.; Chandra, R. Efficient fullchip thermal modeling and analysis. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, USA, 7–11 November 2004; pp. 319–326. [Google Scholar]
 Li, P.; Pileggi, L.T.; Asheghi, M.; Chandra, R. IC thermal simulation and modeling via efficient multigridbased approaches. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2006, 25, 1763–1776. [Google Scholar]
 Yang, Y.; Zhu, C.; Gu, Z.; Shang, L.; Dick, R.P. Adaptive multidomain thermal modeling and analysis for integrated circuit synthesis and design. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, USA, 5–9 November 2006; pp. 575–582. [Google Scholar]
 Yang, Y.; Gu, Z.; Zhu, C.; Dick, R.P.; Shang, L. ISAC: Integrated SpaceandTimeAdaptive ChipPackage Thermal Analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2007, 26, 86–99. [Google Scholar] [CrossRef][Green Version]
 Wang, T.Y.; Chen, C.C.P. 3D ThermalADI: A lineartime chip level transient thermal simulator. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2002, 21, 1434–1445. [Google Scholar] [CrossRef]
 Sridhar, A.; Vincenzi, A.; Ruggiero, M.; Brunschwiler, T.; Atienza, D. 3DICE: Fast compact transient thermal modeling for 3D ICs with intertier liquid cooling. In Proceedings of the IEEE/ACM International Conference on ComputerAided Design, San Jose, CA, USA, 7–11 November 2010; pp. 463–470. [Google Scholar]
 Sridhar, A.; Vincenzi, A.; Atienza, D.; Brunschwiler, T. 3DICE: A Compact Thermal Model for EarlyStage Design of LiquidCooled ICs. IEEE Trans. Comput. 2014, 63, 2576–2589. [Google Scholar] [CrossRef][Green Version]
 Ladenheim, S.; Chen, Y.C.; Mihajlovic, M.; Pavlidis, V. IC thermal analyzer for versatile 3D structures using multigrid preconditioned Krylov methods. In Proceedings of the IEEE/ACM International Conference on ComputerAided Design, Austin, TX, USA, 7–10 November 2016; pp. 1–8. [Google Scholar]
 Zhan, Y.; Sapatnekar, S.S. HighEfficiency Green FunctionBased Thermal Simulation Algorithms. IIEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2007, 26, 1661–1675. [Google Scholar] [CrossRef]
 Vincenzi, A.; Sridhar, A.; Ruggiero, M.; Atienza, D. Fast thermal simulation of 2D/3D integrated circuits exploiting neural networks and GPUs. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, Fukuoka, Japan, 1–3 August 2011; pp. 151–156. [Google Scholar]
 Lee, Y.M.; Pan, C.W.; Huang, P.Y.; Yang, C.P. LUTSim: A LookUp TableBased Thermal Simulator for 3D ICs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2015, 34, 1250–1263. [Google Scholar]
 Wang, T.Y.; Chen, C.C.P. SPICEcompatible thermal simulation with lumped circuit modeling for thermal reliability analysis based on modeling order reduction. In Proceedings of the International Symposium on Signals, Circuits and Systems, San Jose, CA, USA, 22–24 March 2004; pp. 357–362. [Google Scholar]
 Floros, G.; Evmorfopoulos, N.; Stamoulis, G. Efficient Hotspot Thermal Simulation Via LowRank Model Order Reduction. In Proceedings of the 15th International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), Prague, Czech Republic, 2–5 July 2018; pp. 205–208. [Google Scholar]
 Liu, X.X.; Zhai, K.; Liu, Z.; He, K.; Tan, S.X.D.; Yu, W. Parallel Thermal Analysis of 3D Integrated Circuits With Liquid Cooling on CPUGPU Platforms. IEEE Trans. Very Large Scale Integr. Syst. 2015, 23, 575–579. [Google Scholar]
 Ouzisik, N. Heat Transfer—A Basic Approach; McgrawHill College Book Company: New York, NY, USA, 1985. [Google Scholar]
 Bergman, T.; Lavine, B.; Incropera, P.; DeWitt, P. Fundamentals of Heat and Mass Transfer; Wiley: New York, NY, USA, 2017. [Google Scholar]
 Barrett, R.; Berry, M.; Chan, T.; Demmel, J.; Donato, J.; Dongarra, J.; Eijkhout, V.; Pozo, R.; Romine, C.; van der Vorst, H. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed.; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
 Axelsson, O.; Barker, V.A. Quadratic Spline Collocation Methods for Elliptic Partial Differential Equations; Academic Press: Cambridge, MA, USA, 1984. [Google Scholar]
 Daloukas, K.; Marnari, A.; Evmorfopoulos, N.; Tsompanopoulou, P.; Stamoulis, G.I. A parallel fast transformbased preconditioning approach for electricalthermal cosimulation of power delivery networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 18–22 March 2013; pp. 1689–1694. [Google Scholar]
 Daloukas, K.; Evmorfopoulos, N.; Tsompanopoulou, P.; Stamoulis, G. Parallel Fast TransformBased Preconditioners for LargeScale Power Grid Analysis on Graphics Processing Units (GPUs). IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2016, 35, 1653–1666. [Google Scholar] [CrossRef]
 Christara, C.C. Quadratic Spline Collocation Methods for Elliptic Partial Differential Equations. BIT Numer. Math. 1994, 34, 33–61. [Google Scholar] [CrossRef]
 Van Loan, C. Computational Frameworks for the Fast Fourier Transform; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
 NVIDIA CUDA Programming Guide, CUSPARSE, CUBLAS, and CUFFT Library User Guides. Available online: http://developer.nvidia.com/nvidiagpucomputingdocumentation (accessed on 19 December 2018).
Figure 1.
Spatial discretization of a chip for thermal analysis and the formulation of the electrical equivalent problem.
Figure 3.
(a) Average number of iterations, (b) Average runtimes for each timestep in each benchmark.
Electrical Circuit  Thermal Circuit 

Voltage  Temperature 
Current  Heat Flow 
Electrical Conductance  Thermal Conductance 
Electrical Resistance  Thermal Resistance 
Electrical Capacitance  Thermal Capacitance 
Current Source  Heat Source 
Table 2.
Runtime results for the three solvers. Bench. is the name of the benchmark, Discr. Point. is the number of points that correspond to the x and y axis, Layers is the number of layers of each chip and corresponds to the z axis, Time (s) denotes the average time required for the solution at each iteration, Iter. is the average number of iterations required for the convergence of each iterative method, while Speedup denotes the speedup of our method over the ICCG.
Bench.  Discr. Points  Layers  ICCG  FTPCG (CPU)  FTPCG (GPU)  

Iter.  Time (s)  Iter.  Time (s)  Speedup  Iter.  Time (s)  Speedup  
ckt1  175  5  48  0.48  12  0.31  1.54×  11  0.03  16× 
ckt2  320  5  57  1.98  15  1.23  1.6×  12  0.08  24.75× 
ckt3  410  6  58  4.20  16  2.64  1.59×  12  0.23  18.26× 
ckt4  500  7  67  8.35  17  4.51  1.85×  12  0.31  26.93× 
ckt5  845  7  58  21.07  12  9.48  2.22×  11  1.34  15.72× 
ckt6  946  7  59  28.27  16  17.94  1.57×  11  1.60  17.65× 
ckt7  1118  8  68  49.35  17  33.07  1.49×  12  2.39  20.64× 
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).