Next Article in Journal
A Robust Optimization Approach for E-Bus Charging and Discharging Scheduling with Vehicle-to-Grid Integration
Next Article in Special Issue
Accelerated Numerical Simulations of a Reaction-Diffusion- Advection Model Using Julia-CUDA
Previous Article in Journal
Objective Framework for Bayesian Inference in Multicomponent Pareto Stress–Strength Model Under an Adaptive Progressive Type-II Censoring Scheme
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

New Strategies Based on Hierarchical Matrices for Matrix Polynomial Evaluation in Exascale Computing Era

1
The Institute of Polymers, Composites, and Biomaterials (IPCB), National Research Council (CNR), 80078 Pozzuoli, Italy
2
Department of Mathematics, University of Naples Federico II, 80138 Napoli, Italy
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(9), 1378; https://doi.org/10.3390/math13091378
Submission received: 18 February 2025 / Revised: 4 April 2025 / Accepted: 19 April 2025 / Published: 23 April 2025

Abstract

:
Advancements in computing platform deployment have acted as both push and pull elements for the advancement of engineering design and scientific knowledge. Historically, improvements in computing platforms were mostly dependent on simultaneous developments in hardware, software, architecture, and algorithms (a process known as co-design), which raised the performance of computational models. But, there are many obstacles to using the Exascale Computing Era sophisticated computing platforms effectively. These include but are not limited to massive parallelism, effective exploitation, and high complexity in programming, such as heterogeneous computing facilities. So, now is the time to create new algorithms that are more resilient, energy-aware, and able to address the demands of increasing data locality and achieve much higher concurrency through high levels of scalability and granularity. In this context, some methods, such as those based on hierarchical matrices (HMs), have been declared among the most promising in the use of new computing resources precisely because of their strongly hierarchical nature. This work aims to start to assess the advantages, and limits, of the use of HMs in operations such as the evaluation of matrix polynomials, which are crucial, for example, in a Graph Convolutional Deep Neural Network (GC-DNN) context. A case study from the GCNN context provides some insights into the effectiveness, in terms of accuracy, of the employment of HMs.

1. Introduction

Matrix polynomials’ role can be considered relevant in many application areas. For example, many methods for computing matrix functions f A , where f is a scalar function and A n × n and f A represent a matrix of the same dimensions as A , require the evaluation of a matrix polynomial. Indeed, under some assumptions about the eigenvalues of matrix A , a matrix’s function f A can be expressed by the following Taylor series (see Theorem 4.7 in [1]):
f A = k = 0 a k A α I k .
and hence approximated by a matrix polynomial in the form (see Theorem 4.8 in [1])
f A k = 0 K a k A α I k .
where I is the identity matrix and where a k = f k α k ! . Many applications can be re-formulated using matrix functions of the type f A and then approximated by a matrix polynomial such as (2). As examples, we can list the following ones:
  • Differential equations offer a wealth of problems based on f A . Indeed, many semi-discretized Partial Differential Equations (PDEs) (for example, see application related to the computational simulation of Power Systems [2]) can be (re)-formulated based on the following expression:
    y t = A y + f t , y , t 0 , T , y 0 = c , y n , A n × n .
    where f t , y contains the nonlinear terms and A is a spatially discretized linear operator [1]. A large class of techniques known as exponential integrators use an explicit methodology to numerically integrate the remaining portion of the solution while treating the linear term exactly for (3). The characteristic of exponential integrators [3] is that they employ the exponential function of the differential equation’s Jacobian or an approximation of it. Since the late 1990s, they have garnered increased attention, primarily because of developments in numerical linear algebra that enabled the effective use of the methods [1].
    The Exponential Time Differencing (ETD) Euler method is a basic illustration of an exponential integrator. EDT enables computing an approximation y n + 1 in one of the N instants t n + 1 , n = 0 , , N 1 into which the interval 0 , T is subdivided of the solution y of (3) by means of the following iterative formula:
    y n + 1 = y n ψ 0 h A + h ψ 1 h A f t n , y n ,
    where h = t n + 1 t n N and where the functions ψ 0 z and ψ 1 z are defined as follows
    ψ 0 z = e z ,
    ψ 1 z = e z 1 z .
    The evaluation of the matrix functions ψ 0 h A and ψ 1 h A , required by the iterative formula (4), can then be approximated by the evaluation of a matrix polynomial of kind (2).
  • In Markov models, which are utilized in many different fields (from sociology to statistics and finance), a matrix function related to matrix exponential f A = e A is crucial. Indeed, consider a time-homogeneous continuous-time Markov process in which individuals move among n states. The entry i , j of the process’s transition probability matrix P t n × n represents the likelihood that an individual who begins in state i at time 0 will be in state j at time t (where i , j = 1 , , n ). The transition intensity matrix Q n × n is associated with the process and is connected to P t by
    P t = e Q t .
    Suppose a given transition matrix P 1 has a generator Q = log P 1 ; then, Q can be used to construct P t at other times through Equation (7). The evaluation of the matrix exponential, required by Equation (7), can then be approximated by the evaluation of a matrix polynomial of kind (2).
A more exhaustive example list of matrix functions that could benefit the polynomial approximations can be found in Higham [1].
In addition to the examples of matrix polynomial use mentioned above, we call attention to another example of an application using matrix polynomial evaluation that is related to the polynomials of the Graph Laplacian defined in the context of Graph Convolutional Deep Neural Networks [4,5] (see Section 4 for details).
Progress in the deployment of computing platforms has constituted both pull and push factors for the advancement of scientific knowledge and engineering design [6,7,8,9,10,11,12,13].
Historically, the high-performance computing (HPC) systems era started in the 1980s, when vector supercomputing dominated high-performance computing. Today, HPC systems are clusters composed of hundreds of nodes and millions of processors/cores where computational accelerators enrich each node in the form of coprocessors, such as General Purpose Graphical Processing Units (GP-GPUs), which are based on high-speed, low-latency interconnects (such as Infiniband) (see Figure 1 for a representation of modern HPC systems) [14].
The results of the popular high-performance LINPACK (HPL) benchmark [15], which is used to create the top 500 ranking of the fastest computers in the world [16], outline the exponential growth in advanced computing capabilities. A list like this demonstrates that the exascale ( 10 18 operations per second) Era is the current state of affairs in the long performance improvement journey that has lasted over fifty years. In fact, the US Oak Ridge National Laboratory’s Frontier supercomputer broke the glass ceiling of the exascale barrier in the second half of 2022 [17]. Historically, improvements in computing platforms were mostly dependent on simultaneous developments in hardware, software, architecture, and algorithms (a process known as co-design), which raised the performance of computational models. But, there are many obstacles to using the Exascale Computing Era sophisticated computing platforms effectively. These include but are not limited to massive parallelism effective exploitation and high complexity in programming such heterogeneous computing facilities.
Research challenges in creating advanced computing systems for exascale objectives were identified by a number of studies [14,18,19].
From the perspective of a computational scientist, special relevance is assigned to the difficulties concerning new mathematical models and algorithms that can guarantee high degrees of scalability [20] and granularity [21], providing answers to the demand of increasing data locality and achieving much higher levels of concurrency. Thus, the time has come to create new algorithms that are more resilient, energy-conscious, and have fewer requirements for synchronization and communication.
To the best of our knowledge, the most recent analysis of the algorithms for polynomial evaluation for modern high-performance computing architectures can be found in [22]. The aim of that article was to evaluate the existing methods of polynomial evaluation on superscalar architecture, applying those methods to the computation of elementary functions such as e x . The long story of parallel algorithms for polynomial evaluation starts with the work of Munro et al. [23], who first investigated parallel algorithms able to overcome the sequential nature of Horner’s algorithm [1]. Parallel polynomial evaluation was been the subject of much research over the past 50 years, such as described also by the survey contained in Ewart et al. [22], demonstrating the high level of interest in these algorithms.
In consideration that
… in pursuit of synchronization reduction, “additive operator” versions of sequentially “multiplicative operator” algorithms are often available … [24],
such work intends to explore new paths for matrix polynomial evaluation that are more suitable for exascale computing systems where synchronization points could be very demanding and where the heterogeneity and the hierarchical architecture of such systems suggest more appropriate algorithms for operations involving matrices.
Some methods for polynomial evaluations, such as the well-known Horner Method, suffer from a sequential nature and a high number of synchronization points or a low degree of parallelism. Furthermore, when matrix polynomials are based on sparse matrices, all the operations involving such matrices require constantly moving the sparse matrix entries from the main memory into the CPU, without register or cache reuse. Therefore, methods that demand less memory bandwidth need to be researched. To this extent, there are promising linear techniques, such as coefficient-matrix-free representations, that avoid sparse matrices, or Hierarchical matrices [25].
In the context just described, this paper aims to
  • introduce some definitions and algorithms related to the concept of a hierarchical matrix;
  • introduce an algorithm for matrix polynomial evaluation that has a good degree of parallelism;
  • describe how such an algorithm can be combined with a hierarchical representation of a sparse matrix;
  • highlight the benefits and limitations of using hierarchical matrices in the evaluation of matrix polynomials based on some case studies [4,5] (see Section 4 for details). The considered case studies are related to the matrix polynomials used in the context of Graph Convolutional Deep Neural Networks that in recent years have gained great importance in many areas, such as medicine, image analysis, and speech recognition, to generate data-driven models;
  • provide some preliminary indications about the parallelization of the proposed algorithms.
The paper is structured as follows: in Section 2, the hierarchical matrices (HMs) are introduced and some algorithms of interest from basic matrix algebra that use HM formulations are described; Section 3 describes a strategy for the matrix polynomial evaluation based on the use of HMs that should be more suitable for exascale computing systems due to a reduced number of synchronization points and the hierarchical structure of HM-based operations; in Section 4, some results related to the use of the algorithms introduced in Section 2 and Section 3 are described to assess some advantages, and limits, of the use of HMs in a case study from the GCNN context. The study is summarized in Section 5, which also provides some information about our future work.

2. Fast Multipole Methods and Hierarchical Matrices

Fast Multipole Methods (FMMs), created by Rokhlin Jr. and Greengard [26], were named among the top 10 algorithms of the 20th century. They are first described in the context of particle simulations, where they reduce the computational cost, for each pairwise interaction in a system of N particles, from O N 2 to O N or to O N log N operations. In such a context, if x i i = 1 , , N denote the locations of a set of N electrical charges and if q i i = 1 , , N denote their source strengths, the aim of a particle simulation is to evaluate the potentials
u i = j = 1 N g x i , x j q j , i = 1 , , N
where g x i , x j is the interaction potential of electrostatics. The computation of all the values of vector u = u i i = 1 , , N T can then be expressed as the following matrix–vector operation
u = A q ,
where the vector q and the matrix A are, respectively, defined as [27]
q = q i i = 1 , , N T and A = g x i , x j i , j = 1 , , N .
Considering that the kernel g x , y is smooth when x and y are not close, then, if I and J are two index subsets of 1 , , N that represent two subsets of x i i I and x j j J of distant points, the sub-matrix A I J = g x i , x j i I , j J admits an approximate rank-P factorization of the kind
A I J B C T ,
where B , C have, respectively, I × P and J × P dimensions [27]. Then, every matrix–vector product involving the sub-matrix A I J can be executed by just O P ( M + N ) operations. In problem (9), no one relation like (10) can hold for all combinations of target and source points. The domain can then be considered cut into pieces, and approximations such as (10) can be used to evaluate interactions between distant pieces, and direct evaluation used only for points that are close. Equivalently, one could say that the matrix–vector product (9) can be computed by exploiting rank deficiencies in off-diagonal blocks of A [27].
Then, throughout its history, FMM laid the groundwork for more practical uses, like the quick multiplication of vectors using completely populated special matrices [27,28,29]. In linear algebra, FMM is known as fast multiplication by H -matrices, which is a combination of the Panel Clustering Method and the mosaic skeleton matrix approaches [13,30,31,32,33].
Paraphrasing Yokota et al. [34], H -matrices based one, appears to be an ideal algorithm for hybrid hierarchical distributed-shared memory architectures targeting billion-thread concurrency, which is the dominant design for contemporary extreme computing resources”.
The use of H -matrices reduces the O m 2 asymptotic computational complexity C C o m p m of a matrix–vector product A b , where A m × m and b m , to an O ( m log m or O m one through a hierarchy of tree-based operations by distinguishing between near interactions that must be treated directly and far interactions that are recursively coarsened.
The principal feature of algorithms based on H -matrix is the recursive generation of a self-similar collection of problems. The H -matrix format is suitable for a variety of operations, including matrix–vector multiplication, matrix inversion, matrix–matrix multiplication, matrix–matrix addition, etc.

2.1. Reduced-Rank Matrix Definitions and Usage

Initially, it is essential to present the concept of R k -matrix. The concept of reduced rank matrix finds wide application in many contexts. In recent years, it has been used in machine learning processes such as supervised classification. Consider as an example the applications in which the Least Absolute Shrinkage and Selection Operator (LASSO) [35] method is used for feature selection [36], which benefits from the reduced rank approximation of the matrices that represent the weight that the single features have in the classification process [37].
Definition 1 
( R k -matrix). Let R M × N be a matrix of the form
R = A B T , A M × k , B T k × N .
Then, R is called an R k -matrix. Any matrix of rank at most k can be represented as an R k -matrix, and each R k -matrix has at most rank k.
Different approaches can be used to compute an R k -matrix approximation M ˜ = M 1 ˜ M 2 ˜ of an arbitrary matrix M [30]. As an example, we cite the one based on the truncated singular value decomposition (TSVD). TSVD represents the best approximation of an arbitrary matrix in the spectral and Frobenius norms, and then Proposition 1 is valid (see [31,32] for details).
Proposition 1 
(Best approximation by a low-rank matrix). Suppose that the matrix M N × N has the singular value decomposition (SVD) [38] M = U Σ V T . Then, U and V are orthogonal, and Σ is a diagonal matrix whose diagonal elements Σ i , i , i = 1 , , N are the singular values σ i of M , where σ i 1 σ i , i = 2 , , N .
Then, both the following two minimization problems
a r g m i n R : r a n k R r M R 2
a r g m i n R : r a n k R r M R F
where
R = U Σ r V T , w i t h Σ r i , j = σ i i f i = j r 0 o t h e r w i s e ,
are solved by the same matrix R such that
M R 2 = σ r + 1 ,
M R F = i = r + 1 N σ i 2 ,
where · 2 and · F denote, respectively, the L2 and Frobenius norms.
Matrix R can also be espressed as its r-reduced SVD:
R = U ¯ Σ ¯ V ¯ T
where U ¯ , V ¯ N × r are matrices built on the first r rows of U and V , respectively, and Σ ¯ r × r is the diagonal matrix built on the top-left r × r block of the matrix Σ. Or, equivalently,
R = A B T
where A = U ¯ and B = V ¯ Σ ¯ or A = U ¯ Σ ¯ and B = V ¯ .
Proposition 2 
(Best approximation by a rank-r matrix M r of a rank-s matrix M s , r < s .). Let M s be a rank-s matrix; then,
M s = A s B s T ,
where A s , B s N × s . The following rank-r matrix M r , expressed as
M r = U r Σ r V r T , w h e r e U r , V r M × r , Σ r r × r ,
and then defined by its r-reduced SVD, is the best rank-r approximation of the rank-s matrix M s . The matrices U r , Σ r , and V r T are defined, given the QR decomposition (let M n × m be a matrix such that n > m . Then, an orthogonal matrix Q n × m and a square upper triangular matrix R n × m exist with M = Q R . Such a factorization of M is called a reduced Q R decomposition of matrix M [32,38].) of matrices A s and B s ; that is
A s = Q A R A , Q R   d e c o m p o s i t i o n   o f   A s
B s = Q B R B , Q R   d e c o m p o s i t i o n   o f   B s
and on the basis of the SVD decomposition of R A R B T ,
R A R B T = U ¯ Σ r V ¯ T , S V D   o f   R A R B T
since
U r = Q A U ¯
V r = Q B V ¯
Definition 2 
(Addition and Formatted Addition of two reduced-rank matrices). Let R 1 = A B T , R 2 = C D T be two reduced-rank matrices, respectively, of ranks k 1 and k 2 . The sum
R = R 1 + R 2 = A C B D T
is a reduced-rank matrix with rank k = k 1 + k 2 . We define the formatted addition R 1 K R 2 of two reduced-rank matrices as the best approximation of the sum in the R K -matrix set. A rank-K approximation of the R K -matrix R 1 R 2 can be computed through the steps described in Proposition 2.
Proposition 3 
(CKD Representation of an R k -matrix M ). Any R k -matrix M factorizable as
M = A B T , w h e r e A , B N × r
can also be expressed as
M = C K D T , w h e r e C , D N × r , K r × r
and vice versa [32].
Proposition 4 
(Multiplication of two R k -matrices). Let R 1 = A B T , R 2 = C D T be two R k -matrices. The multiplication
R = R 1 R 2 = A B T C D T
is an R k -matrix.
Proposition 5 
(Multiplication of an R k -matrix by an arbitrary matrix from the right (or left)). Let R 1 = A B T be an R k -matrix and M an arbitrary matrix. The multiplication
R = R 1 M = A M T B T , o r R = M R 1 = M A B T ,
is an R k -matrix.

2.2. Hierarchical Matrices’ Definitions and Usage

According to [30,31], let us introduce hierarchical matrices (also called H -matrices).
Definition 3 
(Block cluster quad-tree of a matrix A ). Let T I L be a binary tree [39] with the L + 1 levels l = 0 , , L and denote by T I the set of its nodes. T I is called a binary cluster tree corresponding to an index set I if the following conditions hold [13]:
1. 
each node of T I is a subset of the index set I;
2. 
I is the root of T I L (i.e., the node at the 0-th level of T I L );
3. 
if τ T I is a leaf (i.e, a node with no sons), then C m i n l e a f τ C l e a f ;
4. 
if τ T I is not a leaf whose set of sons is represented by S τ T I , then S τ = 2 and τ = ˙ τ S τ τ .
Let I be an index and let A C τ × σ = 0 , 1 be a logical value representing an admissibility condition on τ × σ . Moreover, let T I L be a binary cluster tree on the index set I. The block cluster quad-tree T I × I corresponding to T I L and to the admissibility condition A C τ × σ can be built by the procedure represented in Algorithm 1 [13].
Definition 4 
( H -matrix of blockwise rank k). Let A N × N = a i , j i , j = 1 , , N be a matrix, let I = 1 , , N be the index set of A , and let k N . Let us assume that, for a matrix A and subsets τ , σ I , the notation A τ × σ represents the block A i j i τ , j σ . Moreover, let T I × I L be the block cluster quad-tree on the index set I whose admissibility condition A C τ × σ is defined as
A C τ × σ = 1 i f   A τ × σ   c a n   b e   a p p r o x i m a t e d   b y   a n   R k - m a t r i x   i n   a   s p e c i f i e d   n o r m · 0 o t h e r w i s e
Then, the matrix A is called  H -matrix of blockwise rank k defined on block cluster quad-tree T I × I L .
Let us remember [38] that, given a matrix A , the matrix A ˜ is said to be an approximation of A in a specified norm · if there exists ϵ such that A A ˜ < ϵ .
Algorithm 1 Procedure for building the block quad-tree T I × I L corresponding to a cluster tree T I L and an admissibility condition A C . The index set τ × σ and the value l to be used in the first call to the recursive BlockClusterQuadTree procedure are such that τ = σ = I , l = 0 [13].
  1:
procedure BlockClusterQuadTree( T I × I L , L, l, A C τ × σ , τ × σ )
  2:
Input:  T I × I L , A C τ × σ , L, l, τ × σ
  3:
    if ( A C τ × σ = 0  and  l < L then
  4:
         S τ × σ = τ × σ : τ S τ , σ S σ
  5:
        for  τ × σ S τ × σ  do
  6:
           BlockClusterQuadTree( T I × I L , L, l + 1 , A C τ × σ , τ × σ )
  7:
        end for
  8:
    else
  9:
         S τ × σ =
10:
    end if
11:
end procedure
As examples of the use of the H -matrices to reduce the complexity of effective algorithms, we propose four algorithms that should be useful for the aim of this work. The first two algorithms of the list below are not new and are already present in the literature (for example, see Hackbusch et al. [31]), the third one is new, and the latter is a revisitation/simplification of an algorithm presented elsewhere in the literature [31].
  • the computation of the formatted matrix addition C = A k C B of the H -matrices A , B , C I × I , respectively, of blockwise ranks k A , k B , and k C (see Algorithm 2);
  • the computation of the matrix–vector product (see Algorithm 3);
  • the computation of the scaled matrix C = α A of the H -matrix A I × I by a scalar value α (see Algorithm 4);
  • the computation of the matrix–matrix formatted product C = C K A K B of the H -matrices A , B , C I × I , respectively, of blockwise ranks k A , k B , and k C (see Algorithm 5).
    The algorithm is a simplified version of a more general one used for computing C = C K A K B , where A I × J , B J × K , C I × K are, respectively, H -matrices of blockwise ranks k A , k B , and k C that are defined on block cluster quad-trees T I × J A L , T J × K B L , and T I × K C L . See Hackbusch et al. [31] for details about the general algorithm. The simplification is applicable due to the following assumptions [40]
    The matrices A , B , and C are square: I = J = K
    For the block cluster quad-trees T I × I A L , T I × I B L , and T I × I C L , the following equations hold:
    T I × I A L = T I × I B L = T I × I L
    T I × I C L = T I × I L · T I × I L
    T I × I L · T I × I L T I × I L
    where T I × I L · T I × I L is the so called product of block cluster quad-trees defined as in [40] using its root and the description of the set of sons of each node. In particular,
    the root of T I × I L · T I × I L is
    root T I × I L · T I × I L = I × I
    let τ × σ be a node at the l-th level of T I × I L · T I × I L , and the set S τ × σ of sons of τ × σ is defined by
    S τ × σ = τ × σ |   a   node   ζ   at   the   l - th   level   of   T I × I L ,   a   node   ζ   at   the   l + 1 - th   level   of   T I × I L : τ × ζ S τ × ζ , ζ × σ S ζ × σ
    Equation (26) expresses the condition that the block cluster quad-tree T I × I L is almost idempotent. According to [40], such condition can be expressed as follows. Let τ × σ be a node of T I × I L · T I × I L , and let us define the quantities C τ × σ and C I d T I × I L
    C τ × σ = τ × σ | τ S τ , σ S σ and ζ T I L : τ × ζ , ζ × σ T I × I L
    C I d T I × I L = max τ × σ L T I × I C τ × σ
    where L T I × I represents the set of all the leaves of T I × I L .
    T I × I L is said to be almost idempotent if C I d T I × I L 1 (respectively, idempotent if C I d T I × I L = 1 ).
    According to Lemma 2.19 of [40], for the product of two H -matrices A and B , for which conditions (24)–(26) are valid, the following statement holds.
    For each leaf τ × σ in the set L T I × I , l of all the leaves of T I × I L at the l-th level of T I × I L , let U τ × σ , l be the set defined as
    U τ × σ , l = ζ T I l L | F τ × ζ , ζ × F σ are   nodes   of   the   l - th   level   of   T I × I L and F τ × ζ or ζ × F σ are   leaves   of   T I × I L
    where T I l L and F σ denote, respectively, the set of nodes of T I L at the l-th level of T I L and the father of a node σ .
    Then, for each leaf τ × σ L T I × I , l , where 0 l L , the following equation is valid
    A B τ × σ = i = 0 l ζ U τ × σ , i A τ × ζ · B ζ × σ
    According to Theorem 2.24 in [40], for the rank k A B of matrix A B , we have that
    k A B C I d T I × I L C S p T I × I L L + 1 max k A , k B ,
    where C S p T I × I L is called sparsity constant and is defined as
    C S p T I × I L = max τ T I L | σ T I L | τ × σ is   a   node   of   T I × I L |
Other algorithms of interest from basic matrix algebra that use H -matrices are described in Hackbusch et al. [31].
Algorithm 2 Formatted matrix addition C = A k C B of the H -matrices A , B , C I × I , respectively, of blockwise ranks k A , k B , and k C (defined on block cluster quad-tree T I × I L ). The index sets τ , σ to be used in the first call to the recursive HMatrix-MSum procedure are such that τ = σ = I [13].
  1:
procedure HMatrix-MSum( A , B , C , T I × I L , τ , σ )
  2:
Input:  A , B , T I × I L , τ , σ
  3:
Output:  C
τ × σ is not a leaf of T I × I L
  4:
    if ( S τ × σ ) then
  5:
        for each  τ S τ , σ S σ  do
  6:
           HMatrix-MSum( A , B , C , T I × I L , τ , σ )
  7:
        end for
τ × σ is a leaf of T I × I L
  8:
    else
  9:
         C τ × σ A τ × σ k C B τ × σ
10:
    end if
11:
end procedure
Algorithm 3 Matrix–vector multiplication y = y + A x of the H -matrix of blockwise rank k  A I × I (defined on block cluster quad-tree T I × I L ) with vector x I . The index sets τ × σ to be used in the first call to the recursive HMatrix-MVM procedure are such that τ = σ = I [13].
  1:
procedure HMatrix-MVM( A , y , x , T I × I L , τ × σ )
  2:
Input:  A , y , x , T I × I L , τ × σ
  3:
Output:  y
  4:
    if  S τ × σ  then     ▹ τ × σ is not a leaf of T I × I L
  5:
        for each  τ × σ S τ × σ  do
  6:
           HMatrix-MVM( A , y , x , T I × I L , τ × σ )
  7:
        end for
  8:
    else              ▹ τ × σ is a leaf of T I × I L
  9:
         y τ y τ + A τ × σ x σ
10:
    end if
11:
end procedure
Algorithm 4 Computation of the scaled matrix C = α A of the H -matrix A I × I of blockwise rank k A (defined on block cluster quad-tree T I × I L ) by a scalar value α . The index sets τ , σ to be used in the first call to the recursive HMatrix-MScale procedure are such that τ = σ = I .
  1:
procedure HMatrix-MScale( A , α , C , T I × I L , τ , σ )
  2:
Input:  A , α , T I × I L , τ , σ
  3:
Output:  C
τ × σ is not a leaf of T I × I L
  4:
    if ( S τ × σ ) then
  5:
        for each  τ S τ , σ S σ  do
  6:
           HMatrix-MScale( A , α , C , T I × I L , τ , σ )
  7:
        end for
τ × σ is a leaf of T I × I L
  8:
    else
  9:
         C τ × σ α A τ × σ
10:
    end if
11:
end procedure
Algorithm 5 Matrix–matrix formatted multiplication C = C K A K B of the H -matrices A , B , C I × I , respectively, of blockwise ranks k A , k B , and k C (defined on block cluster quad-tree T I × I L ). The index sets τ , σ to be used in the first call to the recursive HMatrix-MMMult procedure are such that τ = σ = I .
  1:
procedure HMatrix-MMMult( A , B , C , T I × I L , τ , σ )
  2:
Input:  A , B , C , T I × I L , τ , σ
  3:
Output:  C
  4:
    if ( S τ × σ ) then    ▹ τ × σ is not a leaf of T I × I L
  5:
        for each  τ S τ , σ S σ  do
  6:
           HMatrix-MMMult( A , B , C , T I × I L , τ , σ )
  7:
        end for
  8:
    else              ▹ τ × σ is a leaf of T I × I L
  9:
         l Get   the   level   of   the   node   τ × σ of   T I × I L
10:
         C ˜ τ , σ 0
11:
        for each level i : 0 i l  do
12:
           for each  ζ U τ × σ , i  do
13:
                C ˜ τ , σ C ˜ τ , σ K A τ × ζ B ζ × σ
14:
           end for
15:
        end for
16:
         C τ × σ C τ × σ K C ˜ τ , σ
17:
    end if
18:
end procedure
To evaluate the effectiveness of Algorithm 5, we applied it for the computation of the n-th power A n of a matrix A that is the key ingredient of a matrix polynomial evaluation. We denote the operation of computing the n-th power of the HM representations H M A , ϵ , k of A by the symbol H M A , ϵ , k n , where k and ϵ define the admissibility condition as described in Definition 4.
All the presented results are obtained implementing Algorithm 5 in the MATLAB environment (see Table 1 for details about computing resources used for tests).
Matrices from three case studies are considered, and case studies are described in the following.
  • Case Study #1 The matrix is obtained by using the Matlab Airfoil Example (see [41] for details). The matrix A C S # 1 n × n , n = 4253 is structured and sparse and its condition number in L2 norm is κ A C S # 1 = 1.9 e + 01 . The sparsity pattern of A C S # 1 , and its first six n-powers n = 1 , , 6 , are presented in the first row of images in Figure 2.
  • Case Study #2 The matrix is obtained from the SPARSKIT E20R5000 driven cavity example (see [42] for details). The matrix A C S # 2 n × n , n = 4241 is structured and sparse and its condition number in L2 norm is κ A C S # 2 = 1.8 e + 10 . The sparsity pattern of A C S # 2 , and its first six n-powers n = 1 , , 6 , are presented in the first row of images in Figure 3.
  • Case Study #3 The matrix is obtained from the Harwell–Boeing Collection BCSSTK24 (BCS Structural Engineering Matrices) example (see [43] for details). The matrix A C S # 3 n × n , n = 3562 is structured and sparse and its condition number in L2 norm is κ A C S # 3 = 6.4 e + 11 . The sparsity pattern of A C S # 3 , and its first six n-powers n = 1 , , 6 , are presented in the first row of images in Figure 4.
All the considered matrices A C S # 1 , 2 , 3 are scaled by the maximum value of the elements. From the first row of the images in Figure 2, Figure 3 and Figure 4, we can observe that the sparsity level of matrices decreases when the value of power degree n increases.
In Figure 2, Figure 3 and Figure 4, HM representations H M A C S # 1 , 2 , 3 , ϵ , k n for the values of ϵ = 1 e 8 , 1 e 6 , 1 e 4 , and k = 1 , 2 , 3 of matrices A C S # 1 , 2 , 3 are shown as a sparsity pattern, where red and blue colors, respectively, represent the admissible and inadmissible blocks. The admissible blocks are represented by red marking just the elements of each sub-block occupied by the elements of its rank-k approximation factors.
The presented results have the aim to
  • evaluate the propagation of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 as a function of n, where A C S # 1 , 2 , 3 ϵ , k n denotes the natural representation of the HMs H M A C S # 1 , 2 , 3 , ϵ , k n (see (a–c).1 in Figure 5, Figure 6 and Figure 7);
  • compare the number N A C S # 1 , 2 , 3 n of the nonzero elements needed to represent A C S # 1 , 2 , 3 n and the number N H M A C S # 1 , 2 , 3 , ϵ , k n of the total nonzero elements in both admissible and inadmissible blocks of H M A C S # 1 , 2 , 3 , ϵ , k n (see (a–c).3 in Figure 5, Figure 6 and Figure 7);
  • compare theoretical value k T n (obtained by repetitively applying the estimation (30)) with the effective value k E n (see (a–c).2 in Figure 5, Figure 6 and Figure 7). The value of k E n is determined at each step n as
    H M A C S # 1 , 2 , 3 , ϵ , k n = H M A C S # 1 , 2 , 3 , ϵ , k n 1 H M A C S # 1 , 2 , 3 , ϵ , k ,
    based on the following actions
    after the computation of the product (32) by Algorithm 5, for each admissible block of H M A C S # 1 , 2 , 3 , ϵ , k (identified by couple τ , σ of indices sets), we compute the value k τ , σ n for which the corresponding block of H M A C S # 1 , 2 , 3 , ϵ , k n can be considered admissible (with respect to ϵ and k τ , σ n );
    we compute k E n as
    k E n = max τ , σ A k τ , σ n ,
    where A is the set of the admissible blocks of H M A C S # 1 , 2 , 3 , ϵ , k .
The sparsity pattern representation in Figure 2, Figure 3 and Figure 4 shows how admissible and inadmissible blocks are distributed. Such information should help to analyze which matrix structure is best suited, in terms of memory occupancy, for an HM representation.
From Figure 5, Figure 6 and Figure 7, several considerations can be drawn:
  • the theoretical estimate k T n is a large overestimation of the actual value k E n computed in the operations (see plots (a–c).2 in Figure 5, Figure 6 and Figure 7);
  • for all the ϵ considered, the choice in k does not substantially modify the evolution of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 as n varies (see plots (a–c).1 in Figure 5, Figure 6 and Figure 7);
  • since higher values of k imply a higher number of elements in the HM representation, it is convenient to choose the value of k = 1 (see plots (a–c).3 in Figure 5, Figure 6 and Figure 7);
  • some matrices seem to be more suitable than others for HM representation. In particular, it seems that matrices with a sparsity pattern that is not comparable to the one of a band matrix can be more effectively represented both in terms of the number of elements (for example, see the sparsity pattern for the matrices obtained from the Example Test #3 in Figure 4) and in terms of error evolution A C S # 1 n A C S # 1 ϵ , k n 2 .
From Figure 5, Figure 6 and Figure 7 and Figure 2, Figure 3 and Figure 4, we can deduce that, among the example matrices proposed, the one that guarantees the best performance in terms of memory occupancy is the matrix A C S # 3 . Indeed, the yellow line in plot (a–c).3 in Figure 7 almost always, depending on the value of ϵ , remains above the other lines in the same plot. We recall that the yellow line represents the trend, as a function of power degree n, of number N A C S # 3 n of the nonzero elements needed to represent A C S # 3 n ; the other lines, each for different values of k, represent the number N H M A C S # 3 , ϵ , k n of the total nonzero elements in both admissible and inadmissible blocks of H M A C S # 3 , ϵ , k n . The same behavior is not observable for the other example matrices; indeed, for matrix A C S # 1 , the yellow line always remains below the other lines (see plot (a–c).3 in Figure 5); for matrix A C S # 2 , the yellow line sometimes remains above the other lines; generally, all lines are overlapping (see plot (a–c).3 in Figure 6). From images in Figure 4, we can observe that the admissible blocks concentrate near the diagonal, while the inadmissible blocks approximate the off-diagonal blocks with few elements the lower the required approximation precision ϵ is (see images related to ϵ = 1 × 10 4 ). The matrix A C S # 3 has the best performance also in terms of preservation of the accuracy in HM representation of matrices. Such declaration is supported by error values reported in plots (a–c).1 in Figure 7 compared with the homologous tables in Figure 5 and Figure 6.
Figure 2. Sparsity representation for Example Test #1.
Figure 2. Sparsity representation for Example Test #1.
Mathematics 13 01378 g002
Figure 3. Sparsity representation for Example Test #2.
Figure 3. Sparsity representation for Example Test #2.
Mathematics 13 01378 g003
Figure 4. Sparsity representation for Example Test #3.
Figure 4. Sparsity representation for Example Test #3.
Mathematics 13 01378 g004
Figure 5. Results for Example Test #1: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Figure 5. Results for Example Test #1: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Mathematics 13 01378 g005
Figure 6. Results for Example Test #2: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Figure 6. Results for Example Test #2: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Mathematics 13 01378 g006
Figure 7. Results for Example Test #3: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Figure 7. Results for Example Test #3: ϵ = 1 × 10 8 (a), ϵ = 1 × 10 6 (b), ϵ = 1 × 10 4 (c). (a–c).1 Trend of the error A C S # 1 , 2 , 3 n A C S # 1 , 2 , 3 ϵ , k n 2 . (a–c).2 Trends of the numbers k T n and k E n . (a–c).3 Trends of the numbers N A C S # 1 , 2 , 3 n and N H M A C S # 1 , 2 , 3 , ϵ , k n .
Mathematics 13 01378 g007
All the proposed algorithms related to basic linear algebra operations based on the HM representation can be easily parallelized, due to their recursive formulation, by distributing computations across the different components of the computing resource hierarchy. As an example, we report in Algorithm 6 a possible parallel implementation of Algorithm 3.
Algorithm 6 Parallel matrix–vector multiplication y = y + A x of the H -matrix of blockwise rank k  A I × I (defined on block cluster quad-tree T I × I L ) with vector x I . The index sets τ × σ to be used in the first call to the recursive ParHMatrix-MVM procedure are such that τ = σ = I .
  1:
procedure ParHMatrix-MVM( A , y , x , T I × I L , τ × σ )
  2:
Input:  A , y , x , T I × I L , τ × σ
  3:
Output:  y
  4:
    if  S τ × σ  then     ▹ τ × σ is not a leaf of T I × I L
  5:
        parallel for each  τ × σ S τ × σ  do num_tasks ( N T a s k s ) reduction(+: y )
  6:
           ParHMatrix-MVM( A , y , x , T I × I L , τ × σ )
  7:
        end parallel for
  8:
    else
τ × σ is a leaf of T I × I L , then execute the GEMV BLAS2 operation y τ y τ + A τ × σ x σ
  9:
        if ( A τ × σ x σ is admissible) then
10:
            α 1
11:
            β 0
12:
            z GEMV( α , Σ τ × σ V τ × σ T , x σ , β , z , D e v i c e T y p e )
13:
            β 1
14:
            y τ GEMV( α , U τ × σ , z , β , y τ , D e v i c e T y p e )
15:
        else
16:
            α 1
17:
            β 1
18:
            y τ GEMV( α , A τ × σ , x σ , β , y τ , D e v i c e T y p e )
19:
        end if
20:
    end if
21:
end procedure
The pseudocode listed borrows the constructs used by tools such as OpenMP [44]: in particular, it uses the parallel for construct to indicate the distribution of the instructions included in its body among N T a s k s concurrent tasks, while the reduction instruction indicates that the different outputs of the vector y must be added together at the end of the cycle.
The value of variable N T a s k s l , at each level l of the block cluster quad-tree T I × I L , is assumed to be such that N T a s k s l C τ × σ , where C τ × σ = S τ × σ is the cardinality of the set of sons of the index subset τ × σ . If the value of N T a s k s l divides such cardinality, at most, N T a s k s l tasks are spawned, and each task executes C τ × σ / N T a s k s l new call to ParHMatrix-MVM procedure. If N T a s k s l = 1 , the execution of the l-th level of Algorithm 6 coincides with that of Algorithm 3.
We recall that BLAS (Basic Linear Algebra Subprograms) are a set of routines [45] that provide optimized standard building blocks for performing primary vector and matrix operations. BLAS routines can be classified depending on the types of operands: Level 1: operations involving just vector operands; Level 2: operations between vectors and matrices; and Level 3: operations involving just matrix operands. The operation y = β y + α A x is called a GEMV operation when A , x , y , α , and β are, respectively, a matrix, two vectors, and two scalars.
The GEMV BLAS2 operation needed at lines 12 and 14 and 18 in Algorithm 6 is implemented by using the most effective component (identified by the macro D e v i c e T y p e ) of the computing architecture by a call to optimized mathematical software libraries available for that component (for example, the multithread version of the Intel MKL [46] library or the cuBLAS [47] library when using, as D e v i c e T y p e , respectively, the Intel CPUs or the NVIDIA GP-GPU accelerators. All the issues related to the most efficient memory hierarchy accesses are delegated to such optimized versions of the BLAS procedures.
In Figure 8, an example of the execution tree of Algorithm 6 is shown. In the left part of Figure 8, the block structure of the H -matrix A is defined on block cluster quad-tree T I × I L , where L = 2 is represented. The admissible and inadmissible blocks are represented, respectively, by yellow and red boxes. In the execution tree of Algorithm 6 (see the right part of Figure 8) the leaves are represented by a green box. At each of the two levels of the tree, the considered value for N T a s k s l is N T a s k s l = 4 , l = 0 , 1 . The following steps are executed:
  • Starting from level l = 0 , N T a s k s 0 concurrent tasks are spawned, the task with identification number I d 0 = 2 is related to a leaf, and then, the block being an admissible one, there is the procedure compute contribution to the sub-block of vector y related to the index subset τ 1 1 by code at lines 12 and 14 of Algorithm 6. Each of the remaining tasks execute a parallel for each spawning other N T a s k s 1 concurrent tasks executing at the following level l = 2 , with a total of 3 N T a s k s 1 = 12 concurrent tasks.
  • At the level l = 2 , all the block are leaves. If the blocks are admissible, they are used to compute contributions to the sub-blocks of vector y by code at the lines 12 and 14 of Algorithm 6; otherwise, the same sub-blocks are updated by code at the line 18. In particular, assuming that the variable I d I d l 1 l is used to represent the task identification number of a task spawned at the level l by a task with identifier, at the l 1 -th level, I d l 1 ,
    tasks with identification numbers I d 1 1 = 1 , 2 compute contributions to sub-blocks of y related to the index subset τ 1 2 and tasks with identification numbers I d 1 1 = 3 , 4 update sub-blocks of y related to the index subset τ 2 2 . Then, all the tasks spawned from task I d 0 = 1 (see Figure 8(b.1)) compute contributions to sub-blocks of y related to the index subset τ 1 1 .
    In the same way, tasks with identification numbers I d 3 1 = 1 , 2 compute contributions to sub-blocks of y related to the index subset τ 3 2 , and tasks with identification numbers I d 3 1 = 3 , 4 compute contributions to sub-blocks of y related to the index subset τ 4 2 . Then, all the tasks spawned from task I d 0 = 3 (see Figure 8(b.2)) compute contributions to sub-blocks of y related to the index subset τ 2 1 .
    In the same way, the tasks with identification numbers I d 4 1 = 1 , 2 , 3 , 4 (see Figure 8(b.3)) also compute contributions to sub-blocks of y related to the index subset τ 2 1 .
  • At the termination of parallel for at level l = 1 , 0 , the contributions to sub-blocks of y related to the index subset τ 1 1 and τ 2 1 are summed together (by means of the reduce operation) to obtain the final status for vector y .

3. Matrix Polynomial Evaluation

Let p n A be a real polynomial of degree n of the matrix A M × M , where the set α i : α i i = 0 , , n is the set of its coeffients.
p n A = i = 0 n α i A i .
Different methods can be used to evaluate polynomials defined by Equation (33) [1]. We propose the one of Paterson and Stockmeyer [48] in which p m A is written as
p n A = k = 0 r B k A s k , r = n / s
where s is an integer parameter and
B k = i = 0 s k α s k + i A i , where s k = s 1 if k < r n r s if k = r
After the powers A i , i = 1 , , s are computed, polynomials defined in Equation (34) can be evaluated by Horner’s method [1], where each B k is formed when needed.
The two extreme cases, s = 1 and s = n , reduce, respectively, to Horner’s method and to the method that evaluates polynomials via explicit powers.
The total cost T C P S s , r , M of the polynomial evaluation is
T C P S s , r , M = s + r 1 f r M C M ,
where
f r = 1 if r = 0 0 otherwise ,
and where M C M denotes the computational cost of a matrix multiplication A B with A , B M × M .
T C P S s , r , M , defined as in Equation (36), is approximately minimized by q = n . From Equation (36), we can argue that the described method requires much less work than other ones, such as Horner’s method (whose computational cost is T C H n , M = n 1 M C M ), for large q.
In Algorithm 7, a procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method is presented. The version of Algorithm 7 based on HM representation of involved matrices, and hence on the algorithms introduced in Section 2, is listed in Algorithm 8.
Algorithm 7 Procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method. A and P represent, respectively, the matrix and the results of Equation (33). n represents the degree of the polynomial and α the vector of the polynomial coefficients α i i = 0 , , n .
  1:
procedure MatPolyEvaluation( A , P , α , n, s)
  2:
Input:  A , α , n, s
  3:
Output:  P
  4:
     r Compute r = n s
▹ Compute and store the first s powers of A
  5:
    for  i = 0 , , s  do
  6:
         P o w S i A i
  7:
    end for
▹ Compute and store the first r powers of A s
  8:
    for  k = 0 , , r  do
  9:
         P o w R k P o w S s k
10:
         t k Compute t k = k < r ? s 1 : n r s
11:
    end for
▹ Compute and store B k , k = 0 , , r
12:
    for  k = 0 , , r  do
13:
         B k k 0
14:
        for  i = 0 , , t k  do
15:
            B k k Compute B k k = B k k + α s k + i P o w S i
16:
        end for
17:
    end for
▹ Compute P
18:
     P 0
19:
    for  k = 0 , , r  do
20:
         P Compute P = P + B k k P o w R k
21:
    end for
22:
end procedure
To evaluate Algorithm 8, where the involved matrices have an HM representation, we performed some tests using the same case studies described in the previous Section 2. The coefficients of the polynomial are generated randomly due to a uniform distribution in the interval 0 , 1 .
The presented results (see Table 2) have the aim to show the evolution of the errors E A C S # 1 , 2 , 3 , n , s , as a function of the polynomial degree n, for the fixed value of s = 4 , where
E A C S # 1 , 2 , 3 , n , s = p n A C S # 1 , 2 , 3 , s p n H M A C S # 1 , 2 , 3 , ϵ , k , s 2 ,
and where the symbols p n A C S # 1 , 2 , 3 , s and p n H M A C S # 1 , 2 , 3 , ϵ , k , s represent
  • p n A C S # 1 , 2 , 3 , s : the evaluation of the polynomial (33) through Algorithm 8 for the matrix A C S # 1 , 2 , 3 ;
  • p n H M A C S # 1 , 2 , 3 , ϵ , k , s : the natural representation of the evaluation of the polynomial (33) by means of Algorithm 8, where the involved matrices have an HM representation. In such a case, all the operations involving matrices (summation, product, and exponentiation) are based on Algorithms 2 and 5. In the summation operations, C = A k C B , the value k C of the result C is computed as the maximum value between the values k A and k B of the operands A and B .
Algorithm 8 Procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method based on HM representation of involved matrices (defined on block cluster quad-tree T I × I L ). HMA and HMP represent, respectively, the HM representation of the matrix A and the results P of Equation (33). n represents the degree of the polynomial and α the vector of the polynomial coefficients α i i = 0 , , n .
  1:
procedure MatPolyEvaluation( HMA , HMP , α , n, s)
  2:
Input:  HMA , α , n, s
  3:
Output:  HMP
  4:
     r Compute r = n s
▹ Compute and store the first s powers of A
  5:
     H M P o w S 1 : s HMatrix-Power( HMA , s)
▹ Compute and store the first r powers of A s
  6:
     H M P o w R 1 : r HMatrix-Power( H M P o w S s ,r)
  7:
    for  k = 0 , , r  do
  8:
         s k Compute s k = k < r ? s 1 : n r s
  9:
    end for
▹ Compute and store B k , k = 0 , , r
10:
    for  k = 0 , , r  do
11:
         H M B k k 0
12:
        for  i = 0 , , s k  do
▹ Compute scaled matrix HMC = α s k + i H M P o w S i by Algorithm 4
13:
           HMatrix-MScale( H M P o w S i , α s k + i , HMC )
▹ Compute H M B k k by an iterative updating using Algorithm 2
14:
           HMatrix-MSum( H M B k k , HMC , H M B k k , T I × I L , I, I)
15:
        end for
16:
    end for
▹ To compute P , consecutively apply (for r + 1 times) Algorithm 5. At each step, the value of K is computed as already described for Equation (32). The value of K is computed as the maximum value between the values of K C and K A B .
17:
     HMP 0
18:
    for  k = 0 , , r  do
19:
        HMatrix-MMMult( H M B k k , H M P o w R k , HMP , T I × I L , I, I)
20:
    end for
21:
end procedure   
  
22:
function HMatrix-Power( HMA , s)
23:
Input:  HMA , s
24:
     H M A P O W S 1 HMA
▹ Consecutively apply (for s 1 times) Algorithm 5. At each step, the value of K is computed as already described for Equation (32)
25:
    for  i = 2 , , s  do
26:
         H M A P O W S i 0
27:
        HMatrix-MMMult( HMA , H M A P O W S i 1 , H M A P O W S i , T I × I L , I, I)
28:
    end for
29:
    return  H M A P O W S 1 : s
30:
end function
From Table 2, it can be observed that the evolution of the errors E A C S # 1 , 2 , 3 , n , s , as a function of the polynomial degree n, seems to be sufficiently sensitive to the matrix type even with the risk that such errors may explode. Such behavior can have serious consequences on the results of operations involving such kinds of polynomials. It is the case of the example shown in the case study described in Section 4, where the result of a matrix polynomial evaluation is used in a matrix–vector operation of the kind A b = c . If we denote by δ A the error in representing A and by c δ c the results of the operation A δ A b , then we have
δ c 2 c 2 = δ c 2 c 2 + c 2 c 2 1 + c δ c 2 c 2 = 1 + A δ A b 2 c 2 2 + δ A 2 b 2 c 2
Therefore, from (39), it follows that the relative error δ c 2 c 2 of the results of perturbed matrix–vector operation has an upper limit depending on the norm of the error on matrix δ A . So, if δ A has a large value, δ c 2 c 2 also potentially has a large value.
Algorithm 7 for matrix polynomial evaluation based on Paterson and Stockmeyer method (and then Algorithm 8, which is its HM-based variant) lends itself well to parallel implementations; for example, if we imagine dividing the r matrix B k k , k = 1 , , r among N T a s k s concurrent tasks, then
  • the s matrices P o w S i , i = 0 , , s and the r matrices P o w R i , k = 0 , , r (see lines 6 and 9 of Algorithm 7) could be computed by all the N T a s k s tasks concurrently and independently of each other (no communications are needed among tasks);
  • during the update phase of polynomial P (see line 20 of Algorithm 7), the local update of P on each task i d t can be performed concurrently (no communications are needed among tasks). To complete the computation of P , just one collective communication is needed at the end of the algorithm.
Furthermore, every call to a function/procedure of the kind HMatrix-* in Algorithm 8 can be executed by each task using a locally available type of concurrency (a set of CPUs, cores, or accelerators) and fully exploiting the hierarchy of processing units (see Algorithm 6 for an example of such a strategy).
In Figure 9, we show an example of parallel implementation of Algorithm 7 (and then of Algorithm 8) by N T a s k s = 4 and r = 7 . Each task may mapped to a node holding devices such as CPUs and accelerators. The computation of operations involving HM can be locally executed by an execution tree such as those shown in Figure 8.

4. A Case Study in Graph Convolutional Deep Neural Network Applications

As very effective tools for learning on graph-structured data (that stands for building a data-driven model from the same data), Graph Neural Networks (GNNs) have shown cutting-edge performance in a variety of applications, including biological network modeling and social network analysis. Compared to analyzing data separately, the special capability of graphs to capture the structural relationships between data allows for the extraction of additional insights. Among the GNNs, Graph Convolutional Deep Neural Network [49] appears as one of the most prominent graph deep learning models [50].

4.1. Introduction to Graph Convolutional Deep Neural Network

Graph Convolutional Deep Neural Networks are based on the theory of Spectral Analysis of a graph [50,51,52,53]. Spectral graph theory is the field concerned with the study of the eigenvectors and eigenvalues of the matrices that are naturally associated with graphs. One of the goals is to determine important properties of the graph from its graph spectrum.
Apart from theoretical interest, spectral graph theory also has a wide range of applications. Among them, we have to cite the construction of graph filters that are defined in the context of Discrete Signal Processing on Graphs ( D S P G ), whose aim is to represent, process, and analyze structured datasets that can be represented by graphs [52,54,55].
Let us introduce some definitions.
Definition 5 
(Signal defined on a weighted undirected graph G ). A weighted undirected graph G is a triple N , E , w , where N = n 1 , , n N n o d e s is the set of N n o d e s nodes and E is the set of N e d g e s edges. An edge e i , j E is a couple of index i , j such that nodes n i and n j are considered connected in graph G .
The set w = w i , j , i , j E of N e d g e s values is called the set of weights of G , where
w i , j > 0 w i , j = w j , i .
The matrix A = a i , j i , j = 1 , , N n o d e s is the weighted adjacency matrix of the graph G if
a i , j = w i , j i f i , j E , 0 o t h e r w i s e .
Assuming, without a loss of generality, that dataset elements are real scalars, we define a graph signal as a map s from the set N of nodes in the set of real numbers ℜ:
s : n i N s i
For simplicity of discussion, we write graph signals as vectors s = s i , i = 1 , , N n o d e s T .
Definition 6 
(Linear filters for signal defined on a weighted undirected graph G ). A function h : s N h s N is called a filter on a graph G . A filter h on a graph G is called a Linear Filter if a matrix H exists such that
h s = H s .
Definition 7 
(Graph Fourier Transform of a signal defined on a weighted undirected graph G ). Let us define the Graph Laplacian matrix L N n o d e s × N n o d e s of the weighted undirected graph G as the following symmetric positive semidefinite matrix:
L = D A ,
where D = d i a g d N n o d e s × N n o d e s is a diagonal matrix whose diagonal vector d = d i i = 1 , , N n o d e s is defined as
d i = j = 1 N n o d e s A i , j
Matrix D is called degree matrix. In some cases, the Symmetric Normalized Laplacian is defined as
L S y m N o r m = D 1 / 2 L D 1 / 2
Let us consider the singular value decomposition (SVD) [38] of the Laplacian matrix L ; then, due to the properties of L , such decomposition can be written as
L = U Λ U T
where the set of the columns u j i = 1 , , N n o d e s of the matrix U are a set of orthonormal vectors called Graph Fourier Modes and where the diagonal elements λ i i = 1 , , N n o d e s of the diagonal matrix Λ are non-negative and are identified as the frequencies of the graph.
Following Shuman et al. [56], the Graph Fourier Transform s ^ of a signal s defined on an weighted undirected graph G is
s ^ = U T s
The inverse operation s = U s ^ is the Inverse Graph Fourier Transform of signal s ^ .
Definition 8 
(Convolution operation between signals defined on a weighted undirected graph G ). Let s 1 and s 2 be two signals defined on a weighted undirected graph G ; we can define the following convolution operation s 1 G s 2 between signals s 1 and s 2 such that
s 1 G s 2 = U U T s 1 U T s 2
where represents the element-wise Hadamard product.
Definition 9 
(Spectral filtering of a signal defined on a weighted undirected graph G ). Let p K , θ L be a polynomial of the matrix L with order K, where the set θ = θ k k = , , K is the set of its coefficients; then,
p K , θ L = k = 0 K θ k L k
Filtering a signal s , defined on a weighted undirected graph G , by a filter of the kind (49) is equivalent to the convolution operation p G s , where
p = U p ^
and where
p ^ = k = 0 K θ k λ 1 k λ N n o d e s k
As observed in [4,5], spectral filters represented by K-th order polynomials of the Laplacian are exactly K-localized. Indeed,
d i s t G n i , n j > k L i , j k = 0 .
where d i s t G n i , n j is the shortest path distance, i.e., the minimum number of edges connecting two nodes n i and n j on the graph.
In the field of machine learning based on neural networks, the most widely used approach to build a data-driven model is related to a data fitting problem. In such a case, a function f α : n , defined through a set of k parameters α = α j j = 1 , k , should be determined by a learning process on known information to be subsequently used to predict/describe new information.
With the term data fitting, we denote the process of constructing a mathematical function g : n , (the model) that has the best fit to a series of data points x i , y i i = 1 , m : x i n , y i . Curve fitting can involve either interpolation, where an exact fit to the data is required (i.e., g x i = y i , i = 1 , , M , or smoothing, in which a smooth g function is constructed that approximately fits the data; that is,
g x i i = 1 , m y i i = 1 , m < ϵ
for some small value for ϵ and some norm · defined on m .
Definition 10 
(Graph Convolutional Deep Neural Network (DNN) defined on a weighted undirected graph G ). Let N be a DNN composed of L layers and let N n o d e s be the number of neurons in the l-th layer of N . Suppose that the l-th layer of neurons represents the elements of a signal s l defined on a weighted undirected graph G . Let us assume, for each l-th layer of N , a spectral filter p K l , θ l L as defined in Definition 9.
The fitting function f N of N of a Graph Convolutional DNN is defined as the following function compositions:
f N s , θ l l = 2 , , L = l = 2 L f l N s l 1 , θ l .
Regarding the composition of multiple functions in (53), let f : A B and g : B C be two functions. Then, the composition of f and g, denoted by g f , is defined as the function g f : A C given by g f x = g f x , x A .
Given a set of M functions f i | i = 1 , , M such that f i : A i A i + 1 , with the symbol i = 1 M f i , it is intended
i = 1 M f i = f M f M 1 f 2 f 1 .
Each function f l N in (53) is defined as the composed function
f l N s l 1 , θ l = σ y l
where σ y is the so-called activation function, where
y l = p K l , θ l L s l 1
and where s 1 = s .
We observe that at the basis of both the learning and predicting phases of a Graph Convolutional DNN are the operations needed by Equation (55) that are, for each l = 2 , , L :
  • given the values of parameters θ l , the evaluation of matrix polinomial of type (49) that computes the matrix p K l , θ l L ;
  • given the matrix p K l , θ l L , the computation of the matrix–vector product p K l , θ l L s l 1 .

4.2. Preliminary Tests on Two Toy Examples from GC-DNN Context

In the following, some tests are presented to evaluate how the use of Algorithms 8 and 3, both based on HM representation of the involved matrices and, respectively, used in steps 1 and 2 above, impact the accuracy of the computation results. The tests aim to assess the use feasibility of HM-based methods during the relevant phases of a GC-DNN.
Two examples are used. They are based on graphs constructed as a result of two different types of affinity functions of the image I t o y represented in Figure 10. I t o y is grayscale image, composed of n I = 100 × 100 pixels, with 256 levels. In particular,
Example 1. 
A mutual 5-nearest-neighbor weighted graph G S G A F  [51] is considered where the set N of nodes coincides with the set of all the pixel values of I t o y . Two nodes n i , n j N are considered connected by an edge if both n i is among the 5 nearest neighbors of n j and n j is among the 5 nearest neighbors of n i . The distance w i , j S G A F between connected nodes n i , n j N is computed by the Spatial Gaussian affinity function [57] defined as
w i , j S G A F = exp x i x j h S G A F 2
where x i is the coordinate vector of the pixel associated with the node n i and h S G A F = 5 . The weighted adjacency matrix A S G A F of the graph G S G A F is defined as in Equation (40). The considered Laplacian matrix L S G A F is the Symmetric Normalized one as defined in Equation (45) (see Figure 11a for the sparsity representation of L S G A F ), and its condition number in L2 norm is κ L S G A F = 1.691915 e + 17 .
Example 2. 
A weighted graph G S P G A F is considered where the set N of nodes coincides with the set of all the pixel values of I t o y . Two nodes n i , n j N are considered connected by an edge if n i is among the 50 nearest neighbors of n j and n j is among the 50 nearest neighbors of n i in both of two distances: the Spatial Gaussian affinity function and the Photometric Gaussian affinity function [57]. The distance w i , j S P G A F between connected nodes n i , n j N is then defined as
w i , j S P G A F = exp x i x j h S G A F 2 exp z i z j h P G A F 2
where x i is the coordinate vector of the pixel associated with the node n i , z i is the grayscale value of the pixel associated with the node n i , h S G A F = 10 , and h P G A F = 5 . The weighted adjacency matrix A S P G A F of the graph G S P G A F is defined as in Equation (40). The considered Laplacian matrix L S P G A F is based on Equation (43) (see Figure 11b for the sparsity representation of L S P G A F ), and its condition number in L2 norm κ L S P G A F = is numerically infinite.
In both examples, no scaling operation is performed on the Laplacian matrix. Moreover, the HM representation of the Laplacian matrix considered is based on the following parameters: ϵ = 1 × 10 08 , k = 1 . The coefficients of the polynomial are generated randomly as a result of a uniform distribution in the interval 0 , 1 .
In Table 3, the following are listed for different values of the polynomial degree n and different values s for the factors B k degree:
  • the norm E L S G A F , n , s , already defined in Section 3, of the difference between the results of the polynomial evaluation operations performed with and without an HM representation for matrices;
  • the norm E L S G A F s of the difference between the results of the matrix–vector operations performed with and without HM-based algorithms.
Looking at Table 3, the explosion of both the errors E L S P G A F , n , s and E L S G A F s attracts attention (especially for the second example). As suggested by Higham [58], the results of these preliminary tests confirm the problems related to the evaluation of matrix power, i.e., M n . In particular, what makes the difference is not only the matrix condition number but also the behavior of the sequence M n , k = 1 , n . In Figure 12, we report the trends of the L2 norm of the Laplacian matrices’ powers L * i for both the GC-DNN toy examples as functions of the power index i.
Therefore, while the possibility of using HM in the GC-DNN context remains an interesting option, much work remains to be conducted to (1) identify the matrix characteristics that make it less sensitive to error amplification or (2) to define mitigation strategies to reduce such amplification.

5. Conclusions and Future Work

Following a co-design approach that historically often guides and influences the evolution of the algorithms for the upcoming computing systems, this is the moment, at the dawn of the exascale Computing Era, to invest once again in the development of new algorithms that meet the growing need to ensure a high level of scalability and granularity.
In this context, methods based on hierarchical matrices (HMs) have been included among the most promising in the use of new computing resources precisely because of their strongly hierarchical nature.
This work aims to begin establishing the advantages and limitations of using HMs in operations such as the evaluation of matrix polynomials, which are crucial, for example, in the Graph Convolutional Deep Neural Network (GC-DNN) context. The presented tests show how the use of HMs still seems to be far from effective in complex contests such as matrix polynomial evaluation in real applications. These difficulties seem to be related to some characteristics of the matrices, such as their sparsity pattern or their max values [58].
So, bearing in mind the idea of building a truly effective and efficient tool that can make the most of modern supercomputing architectures, our future work will focus on the following: (1) a theoretical study of the characteristics of the matrices that make them more suitable (in terms of error propagation and memory occupancy) for an HM representation in the context of the evaluation of matrix polynomials; (2) the definition of a mitigation strategy for issues that lead to error amplification in order to achieve more stable algorithms [58]: techniques based on permutation, scaling, and/or normalization of the matrices could be considered; (3) the full implementation, in an HPC context, of the presented algorithms (i.e., see Algorithms 6 and 8) and the evaluation of their performances; (4) the validation of the proposed approach in a real application (i.e., from the strategic GC-DNN context).
Although the interest in hierarchical matrices is still high, even among those who develop mathematical software libraries (see for example the list available in [59]), and despite the high potential announced in works that, a decade ago, imagined the future for supercomputing [25], the current investment in the creation of parallel software libraries based on hierarchical matrices and their use seems very poor (one can mention the HMLib library [60] or the experiments from MAGMA developers [61,62]). We, therefore, hope that this work can be a stimulus in reviving interest in such tools in advanced computing contexts.

Author Contributions

Conceptualization, L.C.; methodology, L.C.; software, L.C.; validation, L.C. and V.M.; formal analysis, L.C. and V.M.; writing—original draft preparation, L.C.; writing—review and editing, L.C. and V.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

Luisa Carracciuolo is a member of the “Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM)”. This work was carried out using the computational resources available at the scientific datacenter of the University of Naples Federico II (Naples, Italy).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Higham, N. Functions of Matrices: Theory and Computation; Other Titles in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 2008. [Google Scholar]
  2. Wang, B.; Kestelyn, X.; Kharazian, E.; Grolet, A. Application of Normal Form Theory to Power Systems: A Proof of Concept of a Novel Structure-Preserving Approach. In Proceedings of the 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA, 21–25 July 2024; pp. 1–5. [Google Scholar] [CrossRef]
  3. Schmelzer, T.; Trefethen, L.N. Evaluating matrix functions for exponential integrators via Carathéodory-Fejér approximation and contour integrals. Electron. Trans. Numer. Anal. 2007, 29, 1–18. [Google Scholar]
  4. Daigavane, A.; Ravindran, B.; Aggarwal, G. Understanding Convolutions on Graphs. Distill 2021. [Google Scholar] [CrossRef]
  5. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
  6. Carracciuolo, L.; Lapegna, M. Implementation of a non-linear solver on heterogeneous architectures. Concurr. Comput. Pract. Exp. 2018, 30, e4903. [Google Scholar] [CrossRef]
  7. Mele, V.; Constantinescu, E.M.; Carracciuolo, L.; D’Amore, L. A PETSc parallel-in-time solver based on MGRIT algorithm. Concurr. Comput. Pract. Exp. 2018, 30, e4928. [Google Scholar] [CrossRef]
  8. Carracciuolo, L.; Mele, V.; Szustak, L. About the granularity portability of block-based Krylov methods in heterogeneous computing environments. Concurr. Comput. Pract. Exp. 2021, 33, e6008. [Google Scholar] [CrossRef]
  9. Carracciuolo, L.; Casaburi, D.; D’Amore, L.; D’Avino, G.; Maffettone, P.; Murli, A. Computational simulations of 3D large-scale time-dependent viscoelastic flows in high performance computing environment. J.-Non-Newton. Fluid Mech. 2011, 166, 1382–1395. [Google Scholar] [CrossRef]
  10. Carracciuolo, L.; D’Amore, L.; Murli, A. Towards a parallel component for imaging in PETSc programming environment: A case study in 3-D echocardiography. Parallel Comput. 2006, 32, 67–83. [Google Scholar] [CrossRef]
  11. Murli, A.; D’Amore, L.; Carracciuolo, L.; Ceccarelli, M.; Antonelli, L. High performance edge-preserving regularization in 3D SPECT imaging. Parallel Comput. 2008, 34, 115–132. [Google Scholar] [CrossRef]
  12. D’Amore, L.; Constantinescu, E.; Carracciuolo, L. A Scalable Space-Time Domain Decomposition Approach for Solving Large Scale Nonlinear Regularized Inverse Ill Posed Problems in 4D Variational Data Assimilation. J. Sci. Comput. 2022, 91. [Google Scholar] [CrossRef]
  13. Carracciuolo, L.; D’Amora, U. Mathematical Tools for Simulation of 3D Bioprinting Processes on High-Performance Computing Resources: The State of the Art. Appl. Sci. 2024, 14, 6110. [Google Scholar] [CrossRef]
  14. Reed, D.A.; Dongarra, J. Exascale Computing and Big Data. Commun. ACM 2015, 58, 56–68. [Google Scholar] [CrossRef]
  15. Petitet, A.; Whaley, R.C.; Dongarra, J.; Cleary, A. A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Innovative Computing Laboratory. 2000. Available online: https://icl.utk.edu/hpl/index.html (accessed on 1 April 2025).
  16. Top 500—The List. Available online: https://www.top500.org/ (accessed on 1 April 2025).
  17. Top 500 List—June 2022. Available online: https://www.top500.org/lists/top500/2022/06/ (accessed on 1 April 2025).
  18. Geist, A.; Lucas, R. Major Computer Science Challenges At Exascale. Int. J. High Perform. Comput. Appl. 2009, 23, 427–436. [Google Scholar] [CrossRef]
  19. Chen, W. The demands and challenges of exascale computing: An interview with Zuoning Chen. Natl. Sci. Rev. 2016, 3, 64–67. [Google Scholar] [CrossRef]
  20. Kumar, V.; Gupta, A. Analyzing Scalability of Parallel Algorithms and Architectures. J. Parallel Distrib. Comput. 1994, 22, 379–391. [Google Scholar] [CrossRef]
  21. Kwiatkowski, J. Evaluation of Parallel Programs by Measurement of Its Granularity. In Proceedings of the Parallel Processing and Applied Mathematics; Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2002; pp. 145–153. [Google Scholar] [CrossRef]
  22. Ewart, T.; Cremonesi, F.; Schürmann, F.; Delalondre, F. Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex. ACM Trans. Math. Softw. 2020, 46, 28. [Google Scholar] [CrossRef]
  23. Munro, I.; Paterson, M. Optimal algorithms for parallel polynomial evaluation. In Proceedings of the 12th Annual Symposium on Switching and Automata Theory (SWAT 1971), East Lansing, MI, USA, 13–15 October 1971; pp. 132–139. [Google Scholar] [CrossRef]
  24. Keyes, D.E. Exaflop/s: The why and the how. Comptes Rendus. MÉcanique 2011, 339, 70–77. [Google Scholar] [CrossRef]
  25. Ang, J.; Evans, K.; Geist, A.; Heroux, M.; Hovland, P.D.; Marques, O.; Curfman McInnes, L.; Ng, E.G.; Wild, S.M. Report on the Workshop on Extreme-Scale Solvers: Transition to Future Architectures. Report, U.S. Department of Energy, ASCR, 2012. Available online: https://science.osti.gov/-/media/ascr/pdf/program-documents/docs/reportExtremeScaleSolvers2012.pdf (accessed on 1 April 2025).
  26. Greengard, L.; Rokhlin, V. A fast algorithm for particle simulations. J. Comput. Phys. 1987, 73, 325–348. [Google Scholar] [CrossRef]
  27. Martinsson, P.G. Fast Multipole Methods. In Encyclopedia of Applied and Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2015; pp. 498–508. [Google Scholar] [CrossRef]
  28. Cipra, B.A. The Best of the 20th Century: Editors Name Top 10 Algorithms. SIAM News 2000, 33, 1–2. [Google Scholar]
  29. Beatson, R.; Greengard, L. A Short Course on Fast Multipole Methods, 2001. Available online: https://math.nyu.edu/~greengar/shortcourse_fmm.pdf (accessed on 1 April 2025).
  30. Fenn, M.; Steidl, G. FMM and H-Matrices: A Short Introduction to the Basic Idea. Technical Report, Department for Mathematics and Computer Science, University of Mannheim, 2002. Available online: https://madoc.bib.uni-mannheim.de/744/ (accessed on 1 April 2025).
  31. Hackbusch, W.; Grasedyck, L.; Börm, S. An introduction to hierarchical matrices. Math. Bohem. 2002, 127, 229–241. [Google Scholar] [CrossRef]
  32. Hackbusch, W. Hierarchical Matrices: Algorithms and Analysis, 1st ed.; Springer Series in Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
  33. Borm, S.; Grasedyck, L.; Hackbusch, W. Hierarchical Matrices. Technical Report, Max-Planck-Institut für Mathematik, 2003. Report Number: Lecture Notes 21/2003. Available online: https://www.mis.mpg.de/publications/preprint-repository/lecture_note/2003/issue-21 (accessed on 1 April 2025).
  34. Yokota, R.; Turkiyyah, G.; Keyes, D. Communication Complexity of the Fast Multipole Method and its Algebraic Variants. Supercomput. Front. Innov. 2014, 1, 63–84. [Google Scholar] [CrossRef]
  35. Saperas-Riera, J.; Mateu-Figueras, G.; Martín-Fernández, J.A. Lasso regression method for a compositional covariate regularised by the norm L1 pairwise logratio. J. Geochem. Explor. 2023, 255, 107327. [Google Scholar] [CrossRef]
  36. Serajian, M.; Marini, S.; Alanko, J.N.; Noyes, N.R.; Prosperi, M.; Boucher, C. Scalable de novo classification of antibiotic resistance of Mycobacterium tuberculosis. Bioinformatics 2024, 40, i39–i47. [Google Scholar] [CrossRef] [PubMed]
  37. Lim, H. Low-rank learning for feature selection in multi-label classification. Pattern Recognit. Lett. 2023, 172, 106–112. [Google Scholar] [CrossRef]
  38. Golub, G.H.; Van Loan, C.F. Matrix Computations, 3rd ed.; The Johns Hopkins University Press: Baltimore, ML, USA, 1996. [Google Scholar]
  39. Weisstein, E.W. Binary Tree. From MathWorld—A Wolfram Web Resource, 2024. Available online: https://mathworld.wolfram.com/BinaryTree.html (accessed on 1 April 2025).
  40. Grasedyck, L.; Hackbusch, W. Construction and Arithmetics of H-Matrices. Computing 2003, 70, 295–334. [Google Scholar] [CrossRef]
  41. Graphical Representation of Sparse Matrices. Available online: https://www.mathworks.com/help/matlab/math/graphical-representation-of-sparse-matrices.html (accessed on 1 April 2025).
  42. E20R5000: Driven Cavity, 20×20 Elements, Re = 5000. Available online: https://math.nist.gov/MatrixMarket/data/SPARSKIT/drivcav/e20r5000.html (accessed on 1 April 2025).
  43. BCSSTK24: BCS Structural Engineering Matrices (Eigenvalue Problems) Calgary Olympic Saddledome Arena. Available online: https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc3/bcsstk24.html (accessed on 1 April 2025).
  44. The OpenMP API Specification for Parallel Programming. Available online: https://www.openmp.org/specifications/ (accessed on 1 April 2025).
  45. Lawson, C.L.; Hanson, R.J.; Kincaid, D.R.; Krogh, F.T. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 1979, 5, 308–323. [Google Scholar] [CrossRef]
  46. BLAS and Sparse BLAS Routines of the Intel Math Kernel Library. Available online: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/blas-and-sparse-blas-routines.html (accessed on 1 April 2025).
  47. Basic Linear Algebra on NVIDIA GPUs. Available online: https://developer.nvidia.com/cublas (accessed on 1 April 2025).
  48. Paterson, M.S.; Stockmeyer, L.J. On the Number of Nonscalar Multiplications Necessary to Evaluate Polynomials. SIAM J. Comput. 1973, 2, 60–66. [Google Scholar] [CrossRef]
  49. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  50. Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
  51. von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  52. Chen, X. Understanding Spectral Graph Neural Network. arXiv 2020, arXiv:2012.06660. [Google Scholar]
  53. Nikolaos, K. Spectral Graph Theory and Deep Learning on Graphs. Master’s Thesis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2017. [Google Scholar] [CrossRef]
  54. Sandryhaila, A.; Moura, J.M.F. Discrete Signal Processing on Graphs. IEEE Trans. Signal Process. 2013, 61, 1644–1656. [Google Scholar] [CrossRef]
  55. Sandryhaila, A.; Moura, J.M.F. Discrete signal processing on graphs: Graph fourier transform. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6167–6170. [Google Scholar] [CrossRef]
  56. Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
  57. Wobrock, D. Image Processing Using Graph Laplacian Operator. Master’s Thesis, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden, 2019. Available online: https://github.com/David-Wobrock/master-thesis-writing/blob/master/master_thesis_david_wobrock.pdf (accessed on 1 April 2025).
  58. Higham, N.J. Accuracy and Stability of Numerical Algorithms, 2nd ed.; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2002. [Google Scholar]
  59. Curated List of Hierarchical Matrices. Available online: https://github.com/gchavez2/awesome_hierarchical_matrices (accessed on 1 April 2025).
  60. HLIBpro: Is a Software Library Implementing Algorithms for Hierarchical Matrices. Available online: https://www.hlibpro.com/ (accessed on 1 April 2025).
  61. The Matrix Algebra on GPU and Multicore Architecture (MAGMA) Library Website. Available online: http://icl.cs.utk.edu/magma/ (accessed on 1 April 2025).
  62. Yamazaki, I.; Abdelfattah, A.; Ida, A.; Ohshima, S.; Tomov, S.; Yokota, R.; Dongarra, J. Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, 1 May 2018; Available online: https://icl.utk.edu/files/publications/2018/icl-utk-1049-2018.pdf (accessed on 1 April 2025).
Figure 1. The hierarchical architecture of modern HPC systems. Several processing units (core/CPU) are aggregated in a CPU/node and share some memory devices. Access to remote memory devices on other nodes is performed due to an interconnection network. Memory is organized into levels where access speed is inversely proportional to memory size and directly proportional to the memory distance from the processing units. Accelerators are a very particular type of processing unit with thousands of cores. Credit: Carracciuolo et al. [13].
Figure 1. The hierarchical architecture of modern HPC systems. Several processing units (core/CPU) are aggregated in a CPU/node and share some memory devices. Access to remote memory devices on other nodes is performed due to an interconnection network. Memory is organized into levels where access speed is inversely proportional to memory size and directly proportional to the memory distance from the processing units. Accelerators are a very particular type of processing unit with thousands of cores. Credit: Carracciuolo et al. [13].
Mathematics 13 01378 g001
Figure 8. Example of the execution tree of Algorithm 6. (a) Block structure of the H -matrix A . (b) Execution tree of Algorithm 6: (b.1) execution on the 1-st subtree; (b.2) execution on the 3-rd subtree; (b.3) execution on the 4-th subtree.
Figure 8. Example of the execution tree of Algorithm 6. (a) Block structure of the H -matrix A . (b) Execution tree of Algorithm 6: (b.1) execution on the 1-st subtree; (b.2) execution on the 3-rd subtree; (b.3) execution on the 4-th subtree.
Mathematics 13 01378 g008
Figure 9. Example of parallel implementation of Algorithm 7 (and then of Algorithm 8) by using N T a s k s = 4 and r = 7 .
Figure 9. Example of parallel implementation of Algorithm 7 (and then of Algorithm 8) by using N T a s k s = 4 and r = 7 .
Mathematics 13 01378 g009
Figure 10. An image for GC-DNN toy examples.
Figure 10. An image for GC-DNN toy examples.
Mathematics 13 01378 g010
Figure 11. Sparsity representation of the Laplacian matrices for GC-DNN toy examples. (a) Example 1. (b) Example 2.
Figure 11. Sparsity representation of the Laplacian matrices for GC-DNN toy examples. (a) Example 1. (b) Example 2.
Mathematics 13 01378 g011
Figure 12. Trends, as function of the power index i, of the L2 norm of the Laplacian matrices’ powers L * i for both the GC-DNN toy examples.
Figure 12. Trends, as function of the power index i, of the L2 norm of the Laplacian matrices’ powers L * i for both the GC-DNN toy examples.
Mathematics 13 01378 g012
Table 1. Hardware and software specs of computing resources used for tests.
Table 1. Hardware and software specs of computing resources used for tests.
Processor typeIntel Xeon Gold 6240R CPU@2.40GHz
Number of cores48
OSLinux CentOS 7
MATLAB VersionR2022a Update 2 (9.12.0.1956245) 64-bit
Table 2. Polynomial evaluation test results: list of the errors E A C S # 1 , 2 , 3 , n , s as a function of the polynomial degree n.
Table 2. Polynomial evaluation test results: list of the errors E A C S # 1 , 2 , 3 , n , s as a function of the polynomial degree n.
n E A CS # 1 , n , s E A CS # 2 , n , s E A CS # 3 , n , s
10 3.454 × 10 08 3.688 × 10 08 7.153 × 10 08
14 7.092 × 10 08 1.690 × 10 07 6.179 × 10 07
18 1.019 × 10 07 2.226 × 10 06 2.344 × 10 06
22 1.394 × 10 07 1.055 × 10 05 1.383 × 10 05
Table 3. Results from tests on GC-DNN toy examples.
Table 3. Results from tests on GC-DNN toy examples.
Example 1
n = 6
s E L S G A F , n , s E L S G A F s
3 1.754 × 10 15 1.655 × 10 11
4 1.416 × 10 15 1.428 × 10 11
5 1.403 × 10 15 1.180 × 10 11
n = 8
s E L S G A F , n , s E L S G A F s
3 1.987 × 10 06 8.051 × 10 03
4 1.987 × 10 06 8.057 × 10 03
5 1.987 × 10 06 8.051 × 10 03
n = 10
s E L S G A F , n , s E L S G A F s
3 3.303 × 10 05 1.588 × 10 02
4 4.817 × 10 06 2.278 × 10 03
5 1.484 × 10 04 8.376 × 10 01
Example 2
n = 6
s E L S P G A F , n , s E L S G A F s
3 4.795 × 10 08 1.347 × 10 05
4 4.747 × 10 08 1.368 × 10 05
5 6.289 × 10 08 1.411 × 10 05
n = 8
s E L S P G A F , n , s E L S G A F s
3 1.517 × 10 05 3.148 × 10 03
4 2.091 × 10 05 3.831 × 10 03
5 1.776 × 10 05 3.144 × 10 03
n = 10
s E L S P G A F , n , s E L S G A F s
3 5.320 × 10 02 9.671 × 10 + 00
4 5.686 × 10 02 1.109 × 10 + 01
5 7.158 × 10 02 1.179 × 10 + 01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Carracciuolo, L.; Mele, V. New Strategies Based on Hierarchical Matrices for Matrix Polynomial Evaluation in Exascale Computing Era. Mathematics 2025, 13, 1378. https://doi.org/10.3390/math13091378

AMA Style

Carracciuolo L, Mele V. New Strategies Based on Hierarchical Matrices for Matrix Polynomial Evaluation in Exascale Computing Era. Mathematics. 2025; 13(9):1378. https://doi.org/10.3390/math13091378

Chicago/Turabian Style

Carracciuolo, Luisa, and Valeria Mele. 2025. "New Strategies Based on Hierarchical Matrices for Matrix Polynomial Evaluation in Exascale Computing Era" Mathematics 13, no. 9: 1378. https://doi.org/10.3390/math13091378

APA Style

Carracciuolo, L., & Mele, V. (2025). New Strategies Based on Hierarchical Matrices for Matrix Polynomial Evaluation in Exascale Computing Era. Mathematics, 13(9), 1378. https://doi.org/10.3390/math13091378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop