Next Article in Journal
RabbitRun: An Immersive Virtual Reality Game for Promoting Physical Activities Among People with Low Back Pain
Next Article in Special Issue
Improved Parallel Legalization Schemes for Standard Cell Placement with Obstacles
Previous Article in Journal / Special Issue
Development of a Transmission Line Model for the Thickness Prediction of Thin Films via the Infrared Interference Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms †

by
George Floros
1,*,
Konstantis Daloukas
1,2,
Nestor Evmorfopoulos
1 and
George Stamoulis
1
1
Department of Electrical & Computer Engineering, University of Thessaly, 38221 Volos, Greece
2
Helic Inc., 2350 Mission College Boulevard, Suite 495, Santa Clara, CA 95054, USA
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the Proceedings of the 7th International Conference on Modern Circuit and System Technologies on Electronics and Communications (MOCAST 2018), Thessaloniki, Greece, 7–9 May 2018.
Technologies 2019, 7(1), 1; https://doi.org/10.3390/technologies7010001
Submission received: 1 November 2018 / Revised: 12 December 2018 / Accepted: 17 December 2018 / Published: 20 December 2018
(This article belongs to the Special Issue Modern Circuits and Systems Technologies on Electronics)

Abstract

:
Efficient full-chip thermal simulation is among the most challenging problems facing the EDA industry today, especially for modern 3D integrated circuits, due to the huge linear systems resulting from thermal modeling approaches that require unreasonably long computational times. While the formulation problem, by applying a thermal equivalent circuit, is prevalent and can be easily constructed, the corresponding 3D equations network has an undesirable time-consuming numerical simulation. Direct linear solvers are not capable of handling such huge problems, and iterative methods are the only feasible approach. In this paper, we propose a computationally-efficient iterative method with a parallel preconditioned technique that exploits the resources of massively-parallel architectures such as Graphic Processor Units (GPUs). Experimental results demonstrate that the proposed method achieves a speedup of 2.2× in CPU execution and a 26.93× speedup in GPU execution over the state-of-the-art iterative method.

1. Introduction

The evolution of the manufacturing technology of Integrated Circuits (ICs) has continued unabated over the past fifty years, according to the predictions of Moore’s law and has led to extremely complex circuits (modern processors contain several billion transistors and are easily the most complex human construction), but also to analogous escalation of the problems related to the analysis and simulation of such circuits. Therefore, thermal analysis is one of the most critical challenges arising from the technological evolution. The continuous effort for smaller sizes, in the sub-45-nm era, and greater performance, as well as the new 3D structures have begun to outpace the ability of heat sinks to dissipate the on-chip power.
In particular, aggravation of thermal effects is an inevitable consequence of the continuous scaling trend. High temperature has a significant impact on chip performance and functionality, leading to slower transistor speed, more leakage power, higher interconnect resistance, and reduced reliability [1]. The problem becomes more pronounced in modern technologies due to the multilayer 3D stacking, and the use of new device technologies, like FinFETsand Silicon on Insulator (SOI), which are more sensitive to the self-heating effect [2]. Furthermore, as heat generation is nonuniform, local hotspots and spatial gradients appear. Stacking multiple layers in a 3D chip promises density and performance enhancement. However, it requires extensive thermal analysis as the power density and temperature of these architectures can be quite high. For the above reasons, full chip thermal analysis is a vital, but extremely difficult problem due to the size of the systems that need to be solved for multiple time points and remains a key issue for future microprocessors and ICs [3,4]. Due to this fact, IC thermal analysis problems have drawn considerable attention over the past two decades. To deal with these challenges, prior approaches have focused on the formulation of the problem and the fast steady-state and transient thermal simulation in order to compute the temperature across the whole chip.
Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems. Unfortunately, these methods do not scale well with the dimension of the linear system and are prohibitively expensive, while the thermal problems are becoming larger, in both execution time and memory requirements. On the other hand, iterative Krylov-subspace methods such as Conjugate Gradients (CG) involve only inner products and matrix-vector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally- and memory-efficient.
Moving beyond conventional direct solvers, our early work in [5] introduced an approach for full chip thermal analysis that is based on the Finite Difference Method (FDM) or the formulation of an RC equivalent electrical network in conjunction with a highly parallel iterative Krylov-subspace Preconditioned Conjugate Gradient (PCG) method, which overcomes the computational demands for the very large systems arising from the thermal modeling. In particular, the contributions of this paper to the problem of thermal analysis are:
  • Accelerated solution of thermal grids: The proposed thermal simulator uses FDM with preconditioned CG, which is well-suited, offers faster solution times, and uses less memory than sparse direct solvers.
  • Highly parallel preconditioned mechanism: The specialized structures of thermal grids allow the usage of fast transform solvers as a preconditioned mechanism in the CG method, which are highly parallel and can be easily ported to GPUs.
  • Fast convergence to the solution: Fast transform solvers can handle different matrix blocks, which offers a good preconditioner approximation. This results in considerably more accurate preconditioners that can approximate thermal grids and make CG converge to the final solution in a few iterations.
Experimental results demonstrate that our method achieves speedups of around 20× on GPU and around 2× on CPU for a 10 M node thermal grid over a state-of-the-art iterative method, like Incomplete Cholesky Preconditioned Conjugate Gradient (ICCG) on a CPU.
The rest of the paper, is organized as follows. Section 2 describes the related work on the thermal simulation problem. Section 3 introduces the thermal model that was used in the present work. Section 4 provides a brief description of the 3D fast transform solver. Section 5 describes the proposed approach, combining the methods presented in the two previous sections. Finally, Section 6 presents the results and a discussion about the advantages of the method, followed by the conclusions in Section 7.

2. Related Work

The growing need to simulate large-scale thermal models in technology nodes below 45 nm has led to some important research in the fast thermal estimation of the IC chips. In this section, we briefly review some of these methods. Most transient thermal analysis methodologies have so far relied on solving the entire system, using different modeling techniques, based mainly on the Finite Element Method (FEM), the Finite Difference Method (FDM), and Green’s functions. The research work in [6,7] adopted the FDM method, with a multigrid approach in order to speed up the simulation process, and the FDM method with temporal and spatial adaptation to further accelerate thermal analysis was proposed in [8,9]. Similarly, in [10], the full-chip thermal transient equations were solved in a similar manner using an Alternating Direction Implicit (ADI) method for enhanced computational efficiency. Furthermore, in [11,12], the FDM approach and the RCequivalent were used along with modeling of the fluids for micro-cooling 3D structures. In [13], FEM was adopted for 2D and 3D geometries along with a multigrid preconditioning method and automatic mesh generation for chip geometries. Finally, Green’s functions were used in [14] with discrete cosine transform and its inversion in order to accelerate the numerical computation of the homogeneous and inhomogeneous solution. However, these method are efficient for a limited range of problems, since they have limited potential for parallelism.
Besides the previous conventional approaches, different methods like a Neural Net (NN) approach were used in [15], but since it was based in predictions, it did not always provide an accurate solution to the crucial problem of thermal analysis. Moreover, a Look Up Table (LUT) method based on the power thermal relation, which develops a double-mesh scheme to capture thermal characteristics and store the results in library files, was presented in [16]. However, currently, chips can lead to huge library files due to the highly complex combined heat maps. Furthermore, the reduction of the problem size, through a Model Order Reduction (MOR) process, was proposed in [17,18], which can be useful in addressing the performance of individual devices, but in some cases, it is not enough to address all the reliability issues.
Finally, the authors in [19] provided a parallel iterative Generalized Minimal RESidual (GMRES) method for FDM and micro-cooling problems, but without any special preconditioning approach. Clearly, the concept of a dedicated fully-parallel preconditioned technique has not yet been introduced in the context of transient thermal analysis.

3. On-Chip Thermal Modeling and Analysis

There are three modes of heat transfer: conduction, convection, and radiation. The primary mechanism of heat transfer in solids is by conduction, and the others can be neglected. The starting point for thermal analysis is Fourier’s law of heat conduction [20]:
q ( r , t ) = k t T ( r , t )
which states that the vector of heat flux density q (heat flow per unit area and unit time) is proportional to the negative gradient of temperature T at every spacial point r = [ x , y , z ] T and time t, where k t is the thermal conductivity of the material.
The conservation of energy also states that the divergence of the heat flux q equals the difference between the power generated by external heat sources and the rate of change of temperature, i.e.,
· q ( r , t ) = g ( r , t ) ρ c p T ( r , t ) t
where g ( r , t ) is the power density of the heat sources, c p is the specific heat capacity of the material, and ρ is the density of the material. By combining (1) and (2), we have:
k t 2 T ( r , t ) = g ( r , t ) ρ c p T ( r , t ) t
which may be rewritten as the following parabolic Partial Differential Equation (PDE):
ρ c p T ( r , t ) t = k t 2 T ( r , t ) + g ( r , t ) = k t ( 2 T ( r , t ) x 2 + 2 T ( r , t ) y 2 + 2 T ( r , t ) z 2 ) + g ( r , t )
(normally accompanied by appropriate boundary conditions [21]).
A common procedure for the numerical solution of (4) is by discretization along the three spatial coordinates with steps Δ x , Δ y , and Δ z and substitution of the spatial second-order derivatives by finite difference approximations, leading to the following expression for temperature T i , j , k at each discrete point ( i , j , k ) in relation to its neighboring points:
ρ c p d T i , j , k d t = k t T i + 1 , j , k 2 T i , j , k + T i 1 , j , k Δ x 2 + k t T i , j + 1 , k 2 T i , j , k + T i , j 1 , k Δ y 2 + k t T i , j , k + 1 2 T i , j , k + T i , j , k 1 Δ z 2 + g i , j , k
or by multiplying by Δ x Δ y Δ z :
ρ c p ( Δ x Δ y Δ z ) d T i , j , k d t k t Δ y Δ z Δ x ( T i + 1 , j , k 2 T i , j , k + T i 1 , j , k ) k t Δ x Δ z Δ y ( T i , j + 1 , k 2 T i , j , k + T i , j 1 , k ) k t Δ x Δ y Δ z ( T i , j , k + 1 2 T i , j , k + T i , j , k 1 ) = g i , j , k ( Δ x Δ y Δ z )
There is a well-known analogy between thermal and electrical conduction, where temperature corresponds to voltage and heat flow corresponds to current (see Table 1).
In light of this analogy, Equation (6) has a direct correspondence to an electrical circuit where there is a node at every discrete point or cell in the thermal grid (see Figure 1). Every circuit node is connected to spatially-neighboring nodes via conductances in the directions x, y, z with values:
G x k t Δ y Δ z Δ x , G y k t Δ x Δ z Δ y , G z k t Δ x Δ y Δ z
and there is a capacitance to ground at every node or thermal cell with value:
C ρ c p ( Δ x Δ y Δ z )
The heat sources constitute input excitations and are modeled in the equivalent circuit as the current sources with values:
I i , j , k g i , j , k ( Δ x Δ y Δ z )
The above current sources are connected at the specific points ( i , j , k ) or circuit nodes where there is heat flow (i.e., power dissipation from the underlying chip logic blocks).
The resulting electrical equivalent circuit is described in the time domain, using the Modified Nodal Analysis (MNA) framework, by a system of Ordinary Differential Equations (ODE):
G x ( t ) + C d x ( t ) d t = u ( t )
where G R n × n is a symmetric and positive definite matrix of the conductances (7), C R n × n is a diagonal matrix of cell capacitances (8), x R n is the vector of unknown temperatures T i , j , k at all discretization points (constituting internal states of the system), and u R p is the vector of input excitations from the current sources I i , j , k of (9).
For transient simulation, we can discretize the time interval into time instants t k , k = 1 , 2 , and use the backward-Euler numerical integration method for the calculation of temperature in each discrete time instant t k :
( G + C h k ) x ( t k ) = C h k x ( t k 1 ) + u ( t k )
where h k = t k t k 1 , k = 1 , 2 , is the time step of time t k (which may in general vary during transient analysis). The above equation involves the solution of a very large sparse linear system in each time instant t k .
Direct methods (based on matrix factorization) have been widely used in the past for solving the resulting linear systems, mainly because of their robustness in most types of problems. Unfortunately, these methods do not scale well with the dimension of the linear system and become prohibitively expensive for large-scale networks. Iterative methods involve only inner products and matrix-vector products and constitute a better alternative for large sparse linear systems in many respects, being more computationally and memory efficient. In this work, we employ iterative methods for the solution of large-scale networks arising from the 3D discretization of the chip for thermal analysis. The system matrices arising from the modeling of the thermal grid can also be shown to be Symmetric and Positive Definite (SPD), which allows the use of the efficient method of the Conjugate Gradient (CG) for the solution of the corresponding linear systems. The CG method is well known [22], and its implementation is shown in Algorithm 1.
Algorithm 1 Preconditioned conjugate gradient.
1:
x = initial guess x ( 0 )
2:
r = b A x
3:
i t e r = 0
4:
repeat
5:
     i t e r = i t e r + 1
6:
    Solve M z = r    (Preconditioner-Solve Step)
7:
     ρ = r · z
8:
    if i t e r = = 1 then
9:
         p = z
10:
    else
11:
         β = ρ / ρ 1
12:
         p = z + β p
13:
    end if
14:
     ρ 1 = ρ
15:
     q = A p
16:
     α = ρ / ( p · q )
17:
     x = x + α p
18:
     r = r α q
19:
until | | b Ax | | | | b | | < t o l
Regarding the convergence rate of CG, it can be shown [23] that the required number of iterations (for a given initial guess and convergence tolerance) is bounded in terms of the spectral condition number k 2 ( A ) = | | A | | 2 | | A 1 | | 2 1 ; specifically, it is O ( k 2 ( A ) ) , which for SPD matrices becomes k 2 ( A ) = λ m a x ( A ) λ m i n ( A ) where λ m a x ( A ) , λ m i n ( A ) are the maximum and minimum eigenvalues of A , respectively. This means that convergence of CG is fast when k 2 ( A ) 1 and slow when k 2 ( A ) > > 1 .
To improve the convergence speed, it is necessary to apply a preconditioning mechanism, which transforms the initial linear system into an equivalent one with a more favorable spectral condition number. The so-called preconditioner is a matrix M that approximates A in some way, such that the transformed system M 1 Ax = M 1 b (which obviously has the same solution as the initial Ax = b ) exhibits condition number k 2 ( M 1 A ) = k 2 ( I ) = 1 . In practice, it is not necessary to invert the preconditioner M and apply it directly at the system Ax = b . It can be shown that the same thing can be accomplished by introducing an extra computational step within the iterative method, which entails solving a system Mz = r with known Right-Hand Side (RHS) vector r and unknown vector z in every iteration [23].
From the above, it follows that a good preconditioner M must satisfy two key properties:
  • The fast convergence rate of the preconditioned.
  • A linear system involving M is solved much more efficiently than the original system that involves  A .
where “more efficiently” can mean with less asymptotic complexity—ideally, an optimal or near-optimal complexity of O ( N ) or O ( N l o g N ) —and/or significantly more parallelism in the solution procedure. If the preconditioner is faithful enough to reduce the iterations substantially, then the whole burden of the algorithm is transferred to the preconditioner-solve step Mz = r . The next section will describe the proposed form of the preconditioner matrices for 3D thermal networks, as well as the solution of the corresponding linear systems via two series of fast transforms and inverse fast transforms.

4. Fast Transform Preconditioners for 3D Thermal Networks

Recent implementations of fast transform solvers have shown great potential for the solution of block-tridiagonal systems with a special structure [24,25]. This section describes such an algorithm for the solution of an appropriate preconditioner system M z = r by the use of a fast transform solver in a near-optimal number of operations.
Let M be an N × N block-tridiagonal matrix with l diagonal blocks of size m n × m n each (overall N = l m n ), where l is very small (typically 5–8 depending on the material layers (metal and insulator) of the chip), with the following form:
M = M 1 δ 1 I m n δ 1 I m n M 2 δ 2 I m n · · · δ l 2 I m n M l 1 δ l 1 I m n δ l 1 I m n M l
where I mn is the m n × m n identity matrix and M i , i = 1 , , l , are block tridiagonal m n × m n matrices of the form:
M i = T i + γ i I n γ i I n γ i I n T i + 2 γ i I n γ i I n · · · γ i I n T i + γ i I n γ i I n γ i I n T i + γ i I n
where I n is the n × n identity matrix and T i , i = 1 , , m are n × n tridiagonal matrices with the following form:
T i = α i + β i α i α i 2 α i + β i α i · · · α i 2 α i + β i α i α i α i + β i =
α i 1 1 1 2 1 · · · 1 2 1 1 1 + β i I
This class of tridiagonal matrices has a beforehand known eigen-decomposition. Specifically, it can be shown [26] that each T i has n distinct eigenvalues λ i , j , j = 1 , , n , which are given by:
λ i , j = β i + 4 α i s i n 2 ( ( j 1 ) π 2 n ) = β i + α i ( 2 c o s ( ( j 1 ) π n ) 2 )
and a set of n orthonormal eigenvectors q j , j = 1 , , n , with elements:
q j , k = 1 n cos ( 2 k 1 ) ( j 1 ) π 2 n , j = 1 , k = 1 , , n 2 n cos ( 2 k 1 ) ( j 1 ) π 2 n , j = 2 , , n , k = 1 , , n
Note that, the eigenvectors do not depend on the values α i and β i and are the same for every matrix T i . If Q n = [ q 1 , , q n ] denotes the matrix whose columns are the eigenvectors q j , then due to the eigen-decomposition of T i , we have Q n T T i Q n = Λ i = d i a g ( λ i , 1 , , λ i , n ) . By exploiting the diagonalization of the matrix T i and considering that Q n T Q n = I , the system Mz = r is equivalent to the following system:
Q n T Q n T M Q n Q n Q n T Q n T z = Q n T Q n T r
M ˜ 1 δ 1 I m n δ 1 I m n M ˜ 2 δ 2 I m n · · · δ l 2 I m n M ˜ l 1 δ l 1 I m n δ l 1 I m n M ˜ l z ˜ = r ˜
where:
M ˜ i = Λ i ( 1 ) γ i I γ i I Λ i ( 2 ) γ i I · · · γ i I Λ i ( 2 ) γ i I γ i I Λ i ( 1 )
z ˜ = Q n T Q n T z , r ˜ = Q n T Q n T r
and Λ i ( 1 ) = d i a g ( λ i , 1 ( 1 ) , , λ i , n ( 1 ) ) and Λ i ( 2 ) = d i a g ( λ i , 1 ( 2 ) , , λ i , n ( 2 ) ) are diagonal matrices with the eigenvalues of T i + γ i I n , T i + 2 γ i I n , which are the following:
λ i , j ( 1 ) = γ i + β i + α i ( 2 c o s ( ( j 1 ) π n ) 2 ) j = 1 , , n , λ i , j ( 2 ) = 2 γ i + β i + α i ( 2 c o s ( ( j 1 ) π n ) 2 ) j = 1 , , n ,
If the N × 1 vectors r , z , r ˜ , z ˜ are also partitioned into m blocks of size n × 1 each, i.e.,
r = r 1 r m , z = z 1 z m , r ˜ = r ˜ 1 r ˜ m , z ˜ = z ˜ 1 z ˜ m
then we have: r ˜ i = Q n T r i and z ˜ i = Q n T z i z i = Q n z ˜ i , i = 1 , , m .
However, it can be shown [27] that each product Q n T r i = r ˜ i corresponds to a Discrete Cosine Transform of Type-II (DCT-II) on r i , and each product Q n z ˜ i = z i corresponds to an Inverse Discrete Cosine Transform of Type-II (IDCT-II) on z ˜ i . This means that the computation of the whole vector r ˜ from r amounts to m independent DCT-II transforms of size n, and the computation of the whole vector z from z ˜ amounts to m independent IDCT-II transforms of size n. A modification of the Fast Fourier Transform (FFT) can be employed for each of the l m independent DCT-II/IDCT-II transforms [27], giving a total near-optimal operation count of O ( m n log n ) = O ( N log n ) .
If now, P is a permutation matrix of size m n × m n that reorders the elements of a vector or the rows of a matrix as 1 , n + 1 , , ( m 1 ) n + 1 , 2 , n + 2 , , ( m 1 ) n + 2 , , n , n + n , , ( m 1 ) n + n and P 1 , P 1 T are the block-diagonal l m n × l m n permutation matrices P 1 = d i a g ( P , , P ) , P 1 T = d i a g ( P T , , P T ) , then the system at (18) is transformed into:
P 1 M ˜ 1 δ 1 I m n δ 1 I m n M ˜ 2 δ 2 I m n · · · δ l 2 I m n M ˜ l 1 δ l 1 I m n δ l 1 I m n M ˜ l P 1 T P 1 z ˜ = P 1 r ˜
D 1 δ 1 I m n δ 1 I m n D 2 δ 2 I m n · · · δ l 2 I m n D l 1 δ l 1 I m n δ l 1 I m n D l z ˜ P 1 = r ˜ P 1
where D 1 = d i a g ( T ˜ i , 1 , , T ˜ i , n ) , i = 1 , , l , with T ˜ i , j , j = 1 , , n being m × m tridiagonal matrices of the form:
T ˜ i , j = λ i , j ( 1 ) γ i γ i λ i , j ( 2 ) γ i · · · γ i λ i , j ( 2 ) γ i γ i λ i , j ( 1 ) =
γ i 1 1 1 2 1 · · · 1 2 1 1 1 + ( β i + α i ( 2 c o s ( ( j 1 ) π n ) 2 ) ) I m
and z ˜ P 1 = P 1 z ˜ , r ˜ P 1 = P 1 r ˜ . If Λ ˜ i , j = d i a g ( λ ˜ i , j , 1 , , λ ˜ i , j , m ) is the diagonal matrix with the eigenvalues of T ˜ i , j , which are:
λ ˜ i , j , k = γ i ( 2 c o s ( ( k 1 ) π n ) 2 ) + β i + α i ( 2 c o s ( ( j 1 ) π n ) 2 ) , k = 1 , , m
and Q m is the common matrix of eigenvectors for all T ˜ i , j , then by similar reasoning as in (17), the system (21) is equivalent to:
D ˜ 1 δ 1 I m n δ 1 I m n D ˜ 2 δ 2 I m n · · · δ l 2 I m n D ˜ l 1 δ l 1 I m n δ l 1 I m n D ˜ l z ˜ ˜ = r ˜ ˜
where D ˜ i = d i a g ( Λ ˜ i , 1 , , Λ ˜ i , n ) and:
z ˜ ˜ = Q m T Q m T z ˜ 1 P , r ˜ ˜ = Q m T Q m T r ˜ 1 P
In a similar way as previously, the N × 1 vectors z ˜ P 1 , r ˜ P 1 , z ˜ ˜ , and r ˜ ˜ can be partitioned into l n sub-vectors of size m × 1 each, and the DCT-II and IDCT-II are performed accordingly, giving a total near-optimal operation count of O ( l m n log n ) = O ( N log n ) .
If now, P 2 is a permutation matrix of size N × N that reorders the elements of a vector or the rows of a matrix as 1 , m n + 1 , 2 m n + 1 , , ( l 1 ) m n + 1 , 2 , m n + 2 , 2 m n + 2 , , ( l 1 ) m n + 2 , , m n , m n + m n , 2 m n + m n , , ( l 1 ) m n + m n , and P 2 T is the inverse permutation matrix, then System (24) is equivalent to:
M ˜ ˜ z ˜ ˜ P 2 = r ˜ ˜ P 2
where M ˜ ˜ = d i a g ( T ˜ ˜ 1 , 1 , T ˜ ˜ 1 , 2 , , T ˜ ˜ 1 , m , T ˜ ˜ 2 , 1 , T ˜ ˜ 2 , 2 , , T ˜ ˜ 2 , m , T ˜ ˜ n , m ) , with T ˜ ˜ j , k , j = 1 , , n , k = 1 , , m being l × l tridiagonal matrices of the form:
T ˜ ˜ j , k = λ ˜ 1 , i , k δ 1 δ 1 λ ˜ 2 , i , k δ 2 · · · δ l 1 λ ˜ l 1 , i , k δ l δ l λ ˜ l , i , k
and z ˜ ˜ P 2 = P 2 z ˜ ˜ , r ˜ ˜ P 2 = P 2 r ˜ ˜ .
Taking into account the above equations and by applying permutation matrices in order to reorder the elements of the Mz = r system, a fast solution for the preconditioner solve step can be obtained as shown in Algorithm 2.
Algorithm 2 Preconditioner solution for the thermal grid.
1:
Partition vector r into l m sub-vectors r i of size n, and perform DCT-II transform ( Q n T r i ) on each sub-vector to obtain transformed vector r ˜ .
2:
Partition vector r ˜ into l sub-vectors r i ˜ of size m n , and permute each subvector by permutation P , which orders elements as 1 , n + 1 , , ( m 1 ) n + 1 , 2 , n + 2 , , ( m 1 ) n + 2 , , n , n + n , , ( m 1 ) n + n , in order to obtain vector r ˜ P 1 .
3:
Partition vector r ˜ P 1 into ln sub-vectors r i ˜ P 1 of size m, and perform DCT-II transform ( Q m T r i ˜ P 1 ) on each sub-vector to obtain transformed vector r ˜ ˜ .
4:
Permute vector r ˜ ˜ by applying permutation P 2 , which orders elements as 1 , m n + 1 , 2 m n + 1 , , ( l 1 ) m n + 1 , 2 , m n + 2 , 2 m n + 2 , , ( l 1 ) m n + 2 , , m n , m n + m n , 2 m n + m n , , ( l 1 ) m n + m n , in order to obtain vector r ˜ ˜ P 2 .
5:
Calculate elements of matrices T ˜ n , m , and solve the m n tridiagonal systems, in order to obtain vector  z ˜ ˜ P 2 .
6:
Apply inverse permutation P 2 T on vector z ˜ ˜ P 2 so as to obtain vector z ˜ ˜ .
7:
Partition vector z ˜ ˜ into l n sub-vectors z i ˜ ˜ of size m, and perform IDCT-II transform ( Q m z i ˜ ˜ ) on each sub-vector to obtain vector z ˜ P 1 .
8:
Partition vector z ˜ P 1 into l sub-vectors z i ˜ P 1 of size m n , and apply inverse permutation P T on each sub-vector to obtain vector z ˜ .
9:
Partition vector z ˜ into l m sub-vectors z i ˜ of size n, and perform IDCT-II transform ( Q n z i ) on each sub-vector to obtain final solution vector z .

5. Methodology for Full Chip Thermal Analysis

This section applies the theoretical background that was analyzed before for the computation of the temperatures across the chip. The complete methodology consists of the following steps:
  • 3D discretization of the chip: The spatial steps Δ x , Δ y in the x- and y-direction are user defined, but the step Δ z along the z-direction is typically chosen to coincide with the interface between successive layers (metal and insulator). The discretization procedure naturally covers multiple layers in the z-direction and can be easily extended to model heterogeneous structures that can be found in modern chips (e.g., heat sinks).
  • Construction of equivalent electrical circuit: The RC elements of the electrical equivalent are calculated by (7) and (8).
  • Estimation of the power consumption profile of chip logic blocks. This determines the location and the time behavior of heat sources and the value of current sources (9) that constitute the vector u ( t ) in (10).
  • Formulation of equivalent circuit description: Using modified nodal analysis, the equivalent circuit is described by the ODE system (10).
  • Construction of the preconditioner matrix: Based on the algorithm described in the previous section, the preconditioner matrix is constructed based on [24], and the preconditioner-solve step is performed with the fast transform solver. More specifically, the thermal grid is equivalent to a highly regular resistive network, as depicted in Figure 2, with resistive branches connecting nodes in the x, y, and z axis. To create a preconditioner that will approximate the grid matrix, we substitute each horizontal and vertical thermal conductance with its average value in the corresponding layer. Moreover, we substitute each thermal conductance connecting nodes in adjacent layers (z axis) with their average value between the two layers.
  • Compute either the DC or transient solution: The solution is obtained with the iterative PCG method in both cases. Note that in the case of the transient solution, the backward-Euler numerical integration method as in (11) is employed for the calculation of temperature in each discrete time instant. The convergence of the method is accelerated with the highly parallel fast transform solver that is used in preconditioner-solve step in Algorithm 1.
The proposed methodology offers significant advantages over the established thermal simulation methods. Firstly, iterative methods can handle large-scale problems in contrast to direct methods that do not scale well with the matrix dimension and can only be applied to a narrow range of problems. Furthermore, the fast preconditioned solution step exhibits near-optimal computational complexity, low memory requirements, and great potential for parallelism, which can harness the computational power of parallel architectures, such as multicore processors or GPUs, thus further reducing the amount of time required for simulation.

6. Experimental Results

Due to the lack of availability of benchmarks for full chip thermal analysis and in order to evaluate the efficiency of the proposed methodology for thermal simulation, we have created a set of artificial benchmark circuits that represent simplified microprocessor designs (like MIPS and LEON) with a random control logic and datapath, based on the theory described in Section 3. The technology node that was used for this work was 32 nm. Similarly, one can form the linear set of equations for different technology-specific parameters of Equations (7) and (8). Despite the fact that the benchmarks were artificially created, the problems that would arise from a real design would be represented by a similar set of linear equations.
In Table 2 discretization points are the number of discretizations in each axis, while layers are the number of layers that each benchmark has. Moreover, the matrix dimensions can be calculated by multiplying the square of the discretization points by the number of layers. All experiments were executed on a Linux workstation, comprised of an Intel Core i7 processor running at 2.4 GHz (six cores and 24 GB main memory) and an NVIDIA Tesla C2075 GPU with 6 GB of main memory. We have used the CUDA library [28] (Version 5.5, along with the CUBLAS, CUSPARSE, and CUFFT libraries) for mapping the proposed Fast-Transform PCG (FT-PCG) algorithm on the GPU. The ICCG was executed on the CPU since it would not be beneficial to port it on the GPU due many irregular memory transfers. The convergence tolerance ( t o l ) for iterative solvers was set to 10 6 , which is typically sufficient to yield perfect accuracy, and convergence was achieved in all cases. Table 2 presents the results from the evaluation of the aforementioned methods on the set of benchmark circuits. The number Iter. is the number of iterations that the algorithm needed to converge, while Time (s) refers to the time in seconds that were needed to compute the final solution. Both the number of iterations and time represent either the DC solution of the system or the average iterations and time of a time instant in transient analysis.
Comparing the iterative methods, it can be observed that the proposed method was able to reduce the number of iterations required for convergence greatly, as shown in Figure 3a. Compared with general purpose preconditioning methods such as ICCG, the proposed preconditioners take into account the topology characteristics of the thermal grid. As a result, they are able to approximate it faithfully enough and reduce the required number of iterations. Moreover, owing to their inherent parallelism, the proposed preconditioners can utilize the vast amount of computational resources found in massively-parallel architectures, such as GPUs. Thus, their efficiency is increased with the increasing circuit size, by greatly reducing the runtime for each time-step, as depicted in Figure 3b. FT-PCG was able to achieve a speed-up ranging between 1.5× and 2.2× in CPU execution and 16× and 26.93× in GPU execution over ICCG.

7. Conclusions

In this paper, we presented a fast thermal simulation method based on the RC equivalent, which uses fast transform solvers and preconditioned Krylov-subspace iterative solvers. The preconditioned iterative solvers offer linear scaling in simulation time as the thermal grid is increased. Experimental evaluation of the proposed method on a set of thermal benchmarks with the size ranging from 0.15 M–10 M nodes showed that the proposed methodology achieved a speedup ranging between 15.72× and 26.93× over a preconditioned iterative method with an incomplete Cholesky factorization preconditioner when GPUs are utilized.

Author Contributions

Conceptualization, G.F. and N.E.; data curation, G.F.; funding acquisition, N.E. and G.S.; investigation, G.F.; methodology, G.F., K.D. and N.E.; project administration, N.E. and G.S.; software, K.D.; supervision, N.E. and G.S.; validation, G.F. and K.D.; writing, original draft, G.F.; writing, review and editing, G.F., K.D., N.E., and G.S.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EDAElectronic Design Automation
GPUGraphic Processor Unit
ICIntegrated Circuit
SOISilicon on Insulator
FDMFinite Difference Method
PCGPreconditioned Conjugate Gradient
ICCGIncomplete Cholesky Conjugate Gradient
FEMFinite Element Method
ADIAlternating Direction Implicit
NNNeural Net
LUTLook Up Table
MORModel Order Reduction
GMRESGeneralized Minimal RESidual
PDEPartial Differential Equation
LHSLeft-Hand Side
MNAModified Nodal Analysis
ODEOrdinary Differential Equations
SPDSymmetric and Positive Definite
CGConjugate Gradient
RHSRight-Hand Side
DCT-IIDiscrete Cosine Transform of Type-II
IDCT-IIInverse Discrete Cosine Transform of Type-II
FFTFast Fourier Transform
FT-PCGFast Transform Preconditioned Conjugate Gradient

References

  1. Waldrop, M.M. The Chips Are Down for Moore’s Law. Nat. News 2016, 530, 144–147. [Google Scholar] [CrossRef] [PubMed]
  2. Xu, C.; Kolluri, S.K.; Endo, K.; Banerjee, K. Analytical Thermal Model for Self-Heating in Advanced FinFET Devices with Implications for Design and Reliability. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2013, 32, 1045–1058. [Google Scholar]
  3. SIA. International Technology Roadmap for Semiconductors (ITRS) 2015 Edition-ERD; SIA: Washington, DC, USA, 2015. [Google Scholar]
  4. Pedram, M.; Nazarian, S. Thermal modeling, analysis, and management in VLSI circuits: Principles and methods. Proc. IEEE 2006, 94, 1487–1501. [Google Scholar] [CrossRef]
  5. Floros, G.; Daloukas, K.; Evmorfopoulos, N.; Stamoulis, G. A parallel iterative approach for efficient full chip thermal analysis. In Proceedings of the 7th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 7–9 May 2018; pp. 1–4. [Google Scholar]
  6. Li, P.; Pileggi, L.T.; Asheghi, M.; Chandra, R. Efficient full-chip thermal modeling and analysis. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, USA, 7–11 November 2004; pp. 319–326. [Google Scholar]
  7. Li, P.; Pileggi, L.T.; Asheghi, M.; Chandra, R. IC thermal simulation and modeling via efficient multigrid-based approaches. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2006, 25, 1763–1776. [Google Scholar]
  8. Yang, Y.; Zhu, C.; Gu, Z.; Shang, L.; Dick, R.P. Adaptive multi-domain thermal modeling and analysis for integrated circuit synthesis and design. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, San Jose, CA, USA, 5–9 November 2006; pp. 575–582. [Google Scholar]
  9. Yang, Y.; Gu, Z.; Zhu, C.; Dick, R.P.; Shang, L. ISAC: Integrated Space-and-Time-Adaptive Chip-Package Thermal Analysis. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2007, 26, 86–99. [Google Scholar] [CrossRef] [Green Version]
  10. Wang, T.Y.; Chen, C.C.P. 3-D Thermal-ADI: A linear-time chip level transient thermal simulator. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2002, 21, 1434–1445. [Google Scholar] [CrossRef]
  11. Sridhar, A.; Vincenzi, A.; Ruggiero, M.; Brunschwiler, T.; Atienza, D. 3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, USA, 7–11 November 2010; pp. 463–470. [Google Scholar]
  12. Sridhar, A.; Vincenzi, A.; Atienza, D.; Brunschwiler, T. 3D-ICE: A Compact Thermal Model for Early-Stage Design of Liquid-Cooled ICs. IEEE Trans. Comput. 2014, 63, 2576–2589. [Google Scholar] [CrossRef] [Green Version]
  13. Ladenheim, S.; Chen, Y.C.; Mihajlovic, M.; Pavlidis, V. IC thermal analyzer for versatile 3-D structures using multigrid preconditioned Krylov methods. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, Austin, TX, USA, 7–10 November 2016; pp. 1–8. [Google Scholar]
  14. Zhan, Y.; Sapatnekar, S.S. High-Efficiency Green Function-Based Thermal Simulation Algorithms. IIEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2007, 26, 1661–1675. [Google Scholar] [CrossRef]
  15. Vincenzi, A.; Sridhar, A.; Ruggiero, M.; Atienza, D. Fast thermal simulation of 2D/3D integrated circuits exploiting neural networks and GPUs. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, Fukuoka, Japan, 1–3 August 2011; pp. 151–156. [Google Scholar]
  16. Lee, Y.M.; Pan, C.W.; Huang, P.Y.; Yang, C.P. LUTSim: A Look-Up Table-Based Thermal Simulator for 3-D ICs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2015, 34, 1250–1263. [Google Scholar]
  17. Wang, T.Y.; Chen, C.C.P. SPICE-compatible thermal simulation with lumped circuit modeling for thermal reliability analysis based on modeling order reduction. In Proceedings of the International Symposium on Signals, Circuits and Systems, San Jose, CA, USA, 22–24 March 2004; pp. 357–362. [Google Scholar]
  18. Floros, G.; Evmorfopoulos, N.; Stamoulis, G. Efficient Hotspot Thermal Simulation Via Low-Rank Model Order Reduction. In Proceedings of the 15th International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), Prague, Czech Republic, 2–5 July 2018; pp. 205–208. [Google Scholar]
  19. Liu, X.X.; Zhai, K.; Liu, Z.; He, K.; Tan, S.X.D.; Yu, W. Parallel Thermal Analysis of 3-D Integrated Circuits With Liquid Cooling on CPU-GPU Platforms. IEEE Trans. Very Large Scale Integr. Syst. 2015, 23, 575–579. [Google Scholar]
  20. Ouzisik, N. Heat Transfer—A Basic Approach; Mcgraw-Hill College Book Company: New York, NY, USA, 1985. [Google Scholar]
  21. Bergman, T.; Lavine, B.; Incropera, P.; DeWitt, P. Fundamentals of Heat and Mass Transfer; Wiley: New York, NY, USA, 2017. [Google Scholar]
  22. Barrett, R.; Berry, M.; Chan, T.; Demmel, J.; Donato, J.; Dongarra, J.; Eijkhout, V.; Pozo, R.; Romine, C.; van der Vorst, H. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed.; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
  23. Axelsson, O.; Barker, V.A. Quadratic Spline Collocation Methods for Elliptic Partial Differential Equations; Academic Press: Cambridge, MA, USA, 1984. [Google Scholar]
  24. Daloukas, K.; Marnari, A.; Evmorfopoulos, N.; Tsompanopoulou, P.; Stamoulis, G.I. A parallel fast transform-based preconditioning approach for electrical-thermal co-simulation of power delivery networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 18–22 March 2013; pp. 1689–1694. [Google Scholar]
  25. Daloukas, K.; Evmorfopoulos, N.; Tsompanopoulou, P.; Stamoulis, G. Parallel Fast Transform-Based Preconditioners for Large-Scale Power Grid Analysis on Graphics Processing Units (GPUs). IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2016, 35, 1653–1666. [Google Scholar] [CrossRef]
  26. Christara, C.C. Quadratic Spline Collocation Methods for Elliptic Partial Differential Equations. BIT Numer. Math. 1994, 34, 33–61. [Google Scholar] [CrossRef]
  27. Van Loan, C. Computational Frameworks for the Fast Fourier Transform; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
  28. NVIDIA CUDA Programming Guide, CUSPARSE, CUBLAS, and CUFFT Library User Guides. Available online: http://developer.nvidia.com/nvidia-gpu-computing-documentation (accessed on 19 December 2018).
Figure 1. Spatial discretization of a chip for thermal analysis and the formulation of the electrical equivalent problem.
Figure 1. Spatial discretization of a chip for thermal analysis and the formulation of the electrical equivalent problem.
Technologies 07 00001 g001
Figure 2. Example of a 3D thermal grid that is used for preconditioning.
Figure 2. Example of a 3D thermal grid that is used for preconditioning.
Technologies 07 00001 g002
Figure 3. (a) Average number of iterations, (b) Average runtimes for each time-step in each benchmark.
Figure 3. (a) Average number of iterations, (b) Average runtimes for each time-step in each benchmark.
Technologies 07 00001 g003
Table 1. Analogy between electrical and thermal circuits.
Table 1. Analogy between electrical and thermal circuits.
Electrical CircuitThermal Circuit
VoltageTemperature
CurrentHeat Flow
Electrical ConductanceThermal Conductance
Electrical ResistanceThermal Resistance
Electrical CapacitanceThermal Capacitance
Current SourceHeat Source
Table 2. Runtime results for the three solvers. Bench. is the name of the benchmark, Discr. Point. is the number of points that correspond to the x and y axis, Layers is the number of layers of each chip and corresponds to the z axis, Time (s) denotes the average time required for the solution at each iteration, Iter. is the average number of iterations required for the convergence of each iterative method, while Speedup denotes the speedup of our method over the ICCG.
Table 2. Runtime results for the three solvers. Bench. is the name of the benchmark, Discr. Point. is the number of points that correspond to the x and y axis, Layers is the number of layers of each chip and corresponds to the z axis, Time (s) denotes the average time required for the solution at each iteration, Iter. is the average number of iterations required for the convergence of each iterative method, while Speedup denotes the speedup of our method over the ICCG.
Bench.Discr. PointsLayersICCGFT-PCG (CPU)FT-PCG (GPU)
Iter.Time (s)Iter.Time (s)SpeedupIter.Time (s)Speedup
ckt11755480.48120.311.54×110.0316×
ckt23205571.98151.231.6×120.0824.75×
ckt34106584.20162.641.59×120.2318.26×
ckt45007678.35174.511.85×120.3126.93×
ckt584575821.07129.482.22×111.3415.72×
ckt694675928.271617.941.57×111.6017.65×
ckt7111886849.351733.071.49×122.3920.64×

Share and Cite

MDPI and ACS Style

Floros, G.; Daloukas, K.; Evmorfopoulos, N.; Stamoulis, G. A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms. Technologies 2019, 7, 1. https://doi.org/10.3390/technologies7010001

AMA Style

Floros G, Daloukas K, Evmorfopoulos N, Stamoulis G. A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms. Technologies. 2019; 7(1):1. https://doi.org/10.3390/technologies7010001

Chicago/Turabian Style

Floros, George, Konstantis Daloukas, Nestor Evmorfopoulos, and George Stamoulis. 2019. "A Preconditioned Iterative Approach for Efficient Full Chip Thermal Analysis on Massively Parallel Platforms" Technologies 7, no. 1: 1. https://doi.org/10.3390/technologies7010001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop