Abstract
Advancements in computing platform deployment have acted as both push and pull elements for the advancement of engineering design and scientific knowledge. Historically, improvements in computing platforms were mostly dependent on simultaneous developments in hardware, software, architecture, and algorithms (a process known as co-design), which raised the performance of computational models. But, there are many obstacles to using the Exascale Computing Era sophisticated computing platforms effectively. These include but are not limited to massive parallelism, effective exploitation, and high complexity in programming, such as heterogeneous computing facilities. So, now is the time to create new algorithms that are more resilient, energy-aware, and able to address the demands of increasing data locality and achieve much higher concurrency through high levels of scalability and granularity. In this context, some methods, such as those based on hierarchical matrices (HMs), have been declared among the most promising in the use of new computing resources precisely because of their strongly hierarchical nature. This work aims to start to assess the advantages, and limits, of the use of HMs in operations such as the evaluation of matrix polynomials, which are crucial, for example, in a Graph Convolutional Deep Neural Network (GC-DNN) context. A case study from the GCNN context provides some insights into the effectiveness, in terms of accuracy, of the employment of HMs.
Keywords:
matrix polynomials; hierarchical matrices; high-performance computing; exascale computing; graph convolutional deep neural network MSC:
65F50; 65Y05; 65F55; 65F60
1. Introduction
Matrix polynomials’ role can be considered relevant in many application areas. For example, many methods for computing matrix functions , where f is a scalar function and and represent a matrix of the same dimensions as , require the evaluation of a matrix polynomial. Indeed, under some assumptions about the eigenvalues of matrix , a matrix’s function can be expressed by the following Taylor series (see Theorem 4.7 in [1]):
and hence approximated by a matrix polynomial in the form (see Theorem 4.8 in [1])
where is the identity matrix and where . Many applications can be re-formulated using matrix functions of the type and then approximated by a matrix polynomial such as (2). As examples, we can list the following ones:
- Differential equations offer a wealth of problems based on . Indeed, many semi-discretized Partial Differential Equations (PDEs) (for example, see application related to the computational simulation of Power Systems [2]) can be (re)-formulated based on the following expression:where contains the nonlinear terms and is a spatially discretized linear operator [1]. A large class of techniques known as exponential integrators use an explicit methodology to numerically integrate the remaining portion of the solution while treating the linear term exactly for (3). The characteristic of exponential integrators [3] is that they employ the exponential function of the differential equation’s Jacobian or an approximation of it. Since the late 1990s, they have garnered increased attention, primarily because of developments in numerical linear algebra that enabled the effective use of the methods [1].The Exponential Time Differencing (ETD) Euler method is a basic illustration of an exponential integrator. EDT enables computing an approximation in one of the N instants into which the interval is subdivided of the solution y of (3) by means of the following iterative formula:where and where the functions and are defined as follows
- In Markov models, which are utilized in many different fields (from sociology to statistics and finance), a matrix function related to matrix exponential is crucial. Indeed, consider a time-homogeneous continuous-time Markov process in which individuals move among n states. The entry of the process’s transition probability matrix represents the likelihood that an individual who begins in state i at time 0 will be in state j at time t (where ). The transition intensity matrix is associated with the process and is connected to by
A more exhaustive example list of matrix functions that could benefit the polynomial approximations can be found in Higham [1].
In addition to the examples of matrix polynomial use mentioned above, we call attention to another example of an application using matrix polynomial evaluation that is related to the polynomials of the Graph Laplacian defined in the context of Graph Convolutional Deep Neural Networks [4,5] (see Section 4 for details).
Progress in the deployment of computing platforms has constituted both pull and push factors for the advancement of scientific knowledge and engineering design [6,7,8,9,10,11,12,13].
Historically, the high-performance computing (HPC) systems era started in the 1980s, when vector supercomputing dominated high-performance computing. Today, HPC systems are clusters composed of hundreds of nodes and millions of processors/cores where computational accelerators enrich each node in the form of coprocessors, such as General Purpose Graphical Processing Units (GP-GPUs), which are based on high-speed, low-latency interconnects (such as Infiniband) (see Figure 1 for a representation of modern HPC systems) [14].
Figure 1.
The hierarchical architecture of modern HPC systems. Several processing units (core/CPU) are aggregated in a CPU/node and share some memory devices. Access to remote memory devices on other nodes is performed due to an interconnection network. Memory is organized into levels where access speed is inversely proportional to memory size and directly proportional to the memory distance from the processing units. Accelerators are a very particular type of processing unit with thousands of cores. Credit: Carracciuolo et al. [13].
The results of the popular high-performance LINPACK (HPL) benchmark [15], which is used to create the top 500 ranking of the fastest computers in the world [16], outline the exponential growth in advanced computing capabilities. A list like this demonstrates that the exascale ( operations per second) Era is the current state of affairs in the long performance improvement journey that has lasted over fifty years. In fact, the US Oak Ridge National Laboratory’s Frontier supercomputer broke the glass ceiling of the exascale barrier in the second half of 2022 [17]. Historically, improvements in computing platforms were mostly dependent on simultaneous developments in hardware, software, architecture, and algorithms (a process known as co-design), which raised the performance of computational models. But, there are many obstacles to using the Exascale Computing Era sophisticated computing platforms effectively. These include but are not limited to massive parallelism effective exploitation and high complexity in programming such heterogeneous computing facilities.
Research challenges in creating advanced computing systems for exascale objectives were identified by a number of studies [14,18,19].
From the perspective of a computational scientist, special relevance is assigned to the difficulties concerning new mathematical models and algorithms that can guarantee high degrees of scalability [20] and granularity [21], providing answers to the demand of increasing data locality and achieving much higher levels of concurrency. Thus, the time has come to create new algorithms that are more resilient, energy-conscious, and have fewer requirements for synchronization and communication.
To the best of our knowledge, the most recent analysis of the algorithms for polynomial evaluation for modern high-performance computing architectures can be found in [22]. The aim of that article was to evaluate the existing methods of polynomial evaluation on superscalar architecture, applying those methods to the computation of elementary functions such as . The long story of parallel algorithms for polynomial evaluation starts with the work of Munro et al. [23], who first investigated parallel algorithms able to overcome the sequential nature of Horner’s algorithm [1]. Parallel polynomial evaluation was been the subject of much research over the past 50 years, such as described also by the survey contained in Ewart et al. [22], demonstrating the high level of interest in these algorithms.
In consideration that
such work intends to explore new paths for matrix polynomial evaluation that are more suitable for exascale computing systems where synchronization points could be very demanding and where the heterogeneity and the hierarchical architecture of such systems suggest more appropriate algorithms for operations involving matrices.… in pursuit of synchronization reduction, “additive operator” versions of sequentially “multiplicative operator” algorithms are often available … [24],
Some methods for polynomial evaluations, such as the well-known Horner Method, suffer from a sequential nature and a high number of synchronization points or a low degree of parallelism. Furthermore, when matrix polynomials are based on sparse matrices, all the operations involving such matrices require constantly moving the sparse matrix entries from the main memory into the CPU, without register or cache reuse. Therefore, methods that demand less memory bandwidth need to be researched. To this extent, there are promising linear techniques, such as coefficient-matrix-free representations, that avoid sparse matrices, or Hierarchical matrices [25].
In the context just described, this paper aims to
- introduce some definitions and algorithms related to the concept of a hierarchical matrix;
- introduce an algorithm for matrix polynomial evaluation that has a good degree of parallelism;
- describe how such an algorithm can be combined with a hierarchical representation of a sparse matrix;
- highlight the benefits and limitations of using hierarchical matrices in the evaluation of matrix polynomials based on some case studies [4,5] (see Section 4 for details). The considered case studies are related to the matrix polynomials used in the context of Graph Convolutional Deep Neural Networks that in recent years have gained great importance in many areas, such as medicine, image analysis, and speech recognition, to generate data-driven models;
- provide some preliminary indications about the parallelization of the proposed algorithms.
The paper is structured as follows: in Section 2, the hierarchical matrices (HMs) are introduced and some algorithms of interest from basic matrix algebra that use HM formulations are described; Section 3 describes a strategy for the matrix polynomial evaluation based on the use of HMs that should be more suitable for exascale computing systems due to a reduced number of synchronization points and the hierarchical structure of HM-based operations; in Section 4, some results related to the use of the algorithms introduced in Section 2 and Section 3 are described to assess some advantages, and limits, of the use of HMs in a case study from the GCNN context. The study is summarized in Section 5, which also provides some information about our future work.
2. Fast Multipole Methods and Hierarchical Matrices
Fast Multipole Methods (FMMs), created by Rokhlin Jr. and Greengard [26], were named among the top 10 algorithms of the 20th century. They are first described in the context of particle simulations, where they reduce the computational cost, for each pairwise interaction in a system of N particles, from to or to operations. In such a context, if denote the locations of a set of N electrical charges and if denote their source strengths, the aim of a particle simulation is to evaluate the potentials
where is the interaction potential of electrostatics. The computation of all the values of vector can then be expressed as the following matrix–vector operation
where the vector and the matrix are, respectively, defined as [27]
Considering that the kernel is smooth when and are not close, then, if I and J are two index subsets of that represent two subsets of and of distant points, the sub-matrix admits an approximate rank-P factorization of the kind
where have, respectively, and dimensions [27]. Then, every matrix–vector product involving the sub-matrix can be executed by just operations. In problem (9), no one relation like (10) can hold for all combinations of target and source points. The domain can then be considered cut into pieces, and approximations such as (10) can be used to evaluate interactions between distant pieces, and direct evaluation used only for points that are close. Equivalently, one could say that the matrix–vector product (9) can be computed by exploiting rank deficiencies in off-diagonal blocks of [27].
Then, throughout its history, FMM laid the groundwork for more practical uses, like the quick multiplication of vectors using completely populated special matrices [27,28,29]. In linear algebra, FMM is known as fast multiplication by -matrices, which is a combination of the Panel Clustering Method and the mosaic skeleton matrix approaches [13,30,31,32,33].
Paraphrasing Yokota et al. [34], “-matrices based one, appears to be an ideal algorithm for hybrid hierarchical distributed-shared memory architectures targeting billion-thread concurrency, which is the dominant design for contemporary extreme computing resources”.
The use of -matrices reduces the asymptotic computational complexity of a matrix–vector product , where and , to an or one through a hierarchy of tree-based operations by distinguishing between near interactions that must be treated directly and far interactions that are recursively coarsened.
The principal feature of algorithms based on -matrix is the recursive generation of a self-similar collection of problems. The -matrix format is suitable for a variety of operations, including matrix–vector multiplication, matrix inversion, matrix–matrix multiplication, matrix–matrix addition, etc.
2.1. Reduced-Rank Matrix Definitions and Usage
Initially, it is essential to present the concept of -matrix. The concept of reduced rank matrix finds wide application in many contexts. In recent years, it has been used in machine learning processes such as supervised classification. Consider as an example the applications in which the Least Absolute Shrinkage and Selection Operator (LASSO) [35] method is used for feature selection [36], which benefits from the reduced rank approximation of the matrices that represent the weight that the single features have in the classification process [37].
Definition 1
(-matrix). Let be a matrix of the form
Then, is called an -matrix. Any matrix of rank at most k can be represented as an -matrix, and each -matrix has at most rank k.
Different approaches can be used to compute an -matrix approximation of an arbitrary matrix M [30]. As an example, we cite the one based on the truncated singular value decomposition (TSVD). TSVD represents the best approximation of an arbitrary matrix in the spectral and Frobenius norms, and then Proposition 1 is valid (see [31,32] for details).
Proposition 1
(Best approximation by a low-rank matrix). Suppose that the matrix has the singular value decomposition (SVD) [38] . Then, and are orthogonal, and Σ is a diagonal matrix whose diagonal elements are the singular values of , where .
Then, both the following two minimization problems
where
are solved by the same matrix such that
where and denote, respectively, the L2 and Frobenius norms.
Matrix can also be espressed as its r-reduced SVD:
where are matrices built on the first r rows of and , respectively, and is the diagonal matrix built on the top-left block of the matrix Σ. Or, equivalently,
where and or and .
Proposition 2
(Best approximation by a rank-r matrix of a rank-s matrix , .). Let be a rank-s matrix; then,
where . The following rank-r matrix , expressed as
and then defined by its r-reduced SVD, is the best rank-r approximation of the rank-s matrix . The matrices , , and are defined, given the QR decomposition (let be a matrix such that . Then, an orthogonal matrix and a square upper triangular matrix exist with . Such a factorization of is called a reduced decomposition of matrix [32,38].) of matrices and ; that is
and on the basis of the SVD decomposition of ,
since
Definition 2
(Addition and Formatted Addition of two reduced-rank matrices). Let , be two reduced-rank matrices, respectively, of ranks and . The sum
is a reduced-rank matrix with rank . We define the formatted addition of two reduced-rank matrices as the best approximation of the sum in the -matrix set. A rank-K approximation of the -matrix can be computed through the steps described in Proposition 2.
Proposition 3
(CKD Representation of an -matrix ). Any -matrix factorizable as
can also be expressed as
and vice versa [32].
Proposition 4
(Multiplication of two -matrices). Let , be two -matrices. The multiplication
is an -matrix.
Proposition 5
(Multiplication of an -matrix by an arbitrary matrix from the right (or left)). Let be an -matrix and an arbitrary matrix. The multiplication
is an -matrix.
2.2. Hierarchical Matrices’ Definitions and Usage
According to [30,31], let us introduce hierarchical matrices (also called -matrices).
Definition 3
(Block cluster quad-tree of a matrix ). Let be a binary tree [39] with the levels and denote by the set of its nodes. is called a binary cluster tree corresponding to an index set I if the following conditions hold [13]:
- 1.
- each node of is a subset of the index set I;
- 2.
- I is the root of (i.e., the node at the 0-th level of );
- 3.
- if is a leaf (i.e, a node with no sons), then ;
- 4.
- if is not a leaf whose set of sons is represented by , then and .
Let I be an index and let be a logical value representing an admissibility condition on . Moreover, let be a binary cluster tree on the index set I. The block cluster quad-tree corresponding to and to the admissibility condition can be built by the procedure represented in Algorithm 1 [13].
Definition 4
(-matrix of blockwise rank k). Let be a matrix, let be the index set of , and let . Let us assume that, for a matrix and subsets , the notation represents the block . Moreover, let be the block cluster quad-tree on the index set I whose admissibility condition is defined as
Then, the matrix is called -matrix of blockwise rank k defined on block cluster quad-tree .
Let us remember [38] that, given a matrix , the matrix is said to be an approximation of in a specified norm if there exists such that .
| Algorithm 1 Procedure for building the block quad-tree corresponding to a cluster tree and an admissibility condition . The index set and the value l to be used in the first call to the recursive BlockClusterQuadTree procedure are such that [13]. |
|
As examples of the use of the -matrices to reduce the complexity of effective algorithms, we propose four algorithms that should be useful for the aim of this work. The first two algorithms of the list below are not new and are already present in the literature (for example, see Hackbusch et al. [31]), the third one is new, and the latter is a revisitation/simplification of an algorithm presented elsewhere in the literature [31].
- the computation of the formatted matrix addition of the -matrices , respectively, of blockwise ranks , , and (see Algorithm 2);
- the computation of the matrix–vector product (see Algorithm 3);
- the computation of the scaled matrix of the -matrix by a scalar value (see Algorithm 4);
- the computation of the matrix–matrix formatted product of the -matrices , respectively, of blockwise ranks , , and (see Algorithm 5).The algorithm is a simplified version of a more general one used for computing , where are, respectively, -matrices of blockwise ranks , , and that are defined on block cluster quad-trees , , and . See Hackbusch et al. [31] for details about the general algorithm. The simplification is applicable due to the following assumptions [40]
- –
- The matrices , , and are square:
- –
- For the block cluster quad-trees , , and , the following equations hold:where is the so called product of block cluster quad-trees defined as in [40] using its root and the description of the set of sons of each node. In particular,
- ∗
- the root of is
- ∗
- let be a node at the l-th level of , and the set of sons of is defined by
Equation (26) expresses the condition that the block cluster quad-tree is almost idempotent. According to [40], such condition can be expressed as follows. Let be a node of , and let us define the quantities andwhere represents the set of all the leaves of .is said to be almost idempotent if (respectively, idempotent if ). - –
- According to Lemma 2.19 of [40], for the product of two -matrices and , for which conditions (24)–(26) are valid, the following statement holds.For each leaf in the set of all the leaves of at the l-th level of , let be the set defined aswhere and denote, respectively, the set of nodes of at the l-th level of and the father of a node .Then, for each leaf , where , the following equation is valid
- –
- According to Theorem 2.24 in [40], for the rank of matrix , we have thatwhere is called sparsity constant and is defined as
Other algorithms of interest from basic matrix algebra that use -matrices are described in Hackbusch et al. [31].
| Algorithm 2 Formatted matrix addition of the -matrices , respectively, of blockwise ranks , , and (defined on block cluster quad-tree ). The index sets , to be used in the first call to the recursive HMatrix-MSum procedure are such that [13]. |
|
| Algorithm 3 Matrix–vector multiplication of the -matrix of blockwise rank k (defined on block cluster quad-tree ) with vector . The index sets to be used in the first call to the recursive HMatrix-MVM procedure are such that [13]. |
|
| Algorithm 4 Computation of the scaled matrix of the -matrix of blockwise rank (defined on block cluster quad-tree ) by a scalar value . The index sets , to be used in the first call to the recursive HMatrix-MScale procedure are such that . |
|
| Algorithm 5 Matrix–matrix formatted multiplication of the -matrices , respectively, of blockwise ranks , , and (defined on block cluster quad-tree ). The index sets , to be used in the first call to the recursive HMatrix-MMMult procedure are such that . |
|
To evaluate the effectiveness of Algorithm 5, we applied it for the computation of the n-th power of a matrix that is the key ingredient of a matrix polynomial evaluation. We denote the operation of computing the n-th power of the HM representations of by the symbol , where k and define the admissibility condition as described in Definition 4.
All the presented results are obtained implementing Algorithm 5 in the MATLAB environment (see Table 1 for details about computing resources used for tests).
Table 1.
Hardware and software specs of computing resources used for tests.
Matrices from three case studies are considered, and case studies are described in the following.
- Case Study #1 The matrix is obtained by using the Matlab Airfoil Example (see [41] for details). The matrix is structured and sparse and its condition number in L2 norm is . The sparsity pattern of , and its first six n-powers , are presented in the first row of images in Figure 2.
- Case Study #2 The matrix is obtained from the SPARSKIT E20R5000 driven cavity example (see [42] for details). The matrix is structured and sparse and its condition number in L2 norm is . The sparsity pattern of , and its first six n-powers , are presented in the first row of images in Figure 3.
- Case Study #3 The matrix is obtained from the Harwell–Boeing Collection BCSSTK24 (BCS Structural Engineering Matrices) example (see [43] for details). The matrix is structured and sparse and its condition number in L2 norm is . The sparsity pattern of , and its first six n-powers , are presented in the first row of images in Figure 4.
All the considered matrices are scaled by the maximum value of the elements. From the first row of the images in Figure 2, Figure 3 and Figure 4, we can observe that the sparsity level of matrices decreases when the value of power degree n increases.
In Figure 2, Figure 3 and Figure 4, HM representations for the values of , and of matrices are shown as a sparsity pattern, where red and blue colors, respectively, represent the admissible and inadmissible blocks. The admissible blocks are represented by red marking just the elements of each sub-block occupied by the elements of its rank-k approximation factors.
The presented results have the aim to
- compare theoretical value (obtained by repetitively applying the estimation (30)) with the effective value (see (a–c).2 in Figure 5, Figure 6 and Figure 7). The value of is determined at each step n asbased on the following actions
- –
- after the computation of the product (32) by Algorithm 5, for each admissible block of (identified by couple of indices sets), we compute the value for which the corresponding block of can be considered admissible (with respect to and );
- –
- we compute aswhere is the set of the admissible blocks of .
The sparsity pattern representation in Figure 2, Figure 3 and Figure 4 shows how admissible and inadmissible blocks are distributed. Such information should help to analyze which matrix structure is best suited, in terms of memory occupancy, for an HM representation.
- some matrices seem to be more suitable than others for HM representation. In particular, it seems that matrices with a sparsity pattern that is not comparable to the one of a band matrix can be more effectively represented both in terms of the number of elements (for example, see the sparsity pattern for the matrices obtained from the Example Test #3 in Figure 4) and in terms of error evolution .
From Figure 5, Figure 6 and Figure 7 and Figure 2, Figure 3 and Figure 4, we can deduce that, among the example matrices proposed, the one that guarantees the best performance in terms of memory occupancy is the matrix . Indeed, the yellow line in plot (a–c).3 in Figure 7 almost always, depending on the value of , remains above the other lines in the same plot. We recall that the yellow line represents the trend, as a function of power degree n, of number of the nonzero elements needed to represent ; the other lines, each for different values of k, represent the number of the total nonzero elements in both admissible and inadmissible blocks of . The same behavior is not observable for the other example matrices; indeed, for matrix , the yellow line always remains below the other lines (see plot (a–c).3 in Figure 5); for matrix , the yellow line sometimes remains above the other lines; generally, all lines are overlapping (see plot (a–c).3 in Figure 6). From images in Figure 4, we can observe that the admissible blocks concentrate near the diagonal, while the inadmissible blocks approximate the off-diagonal blocks with few elements the lower the required approximation precision is (see images related to ). The matrix has the best performance also in terms of preservation of the accuracy in HM representation of matrices. Such declaration is supported by error values reported in plots (a–c).1 in Figure 7 compared with the homologous tables in Figure 5 and Figure 6.
Figure 2.
Sparsity representation for Example Test #1.
Figure 3.
Sparsity representation for Example Test #2.
Figure 4.
Sparsity representation for Example Test #3.
Figure 5.
Results for Example Test #1: (a), (b), (c). (a–c).1 Trend of the error . (a–c).2 Trends of the numbers and . (a–c).3 Trends of the numbers and .
Figure 6.
Results for Example Test #2: (a), (b), (c). (a–c).1 Trend of the error . (a–c).2 Trends of the numbers and . (a–c).3 Trends of the numbers and .
Figure 7.
Results for Example Test #3: (a), (b), (c). (a–c).1 Trend of the error . (a–c).2 Trends of the numbers and . (a–c).3 Trends of the numbers and .
All the proposed algorithms related to basic linear algebra operations based on the HM representation can be easily parallelized, due to their recursive formulation, by distributing computations across the different components of the computing resource hierarchy. As an example, we report in Algorithm 6 a possible parallel implementation of Algorithm 3.
| Algorithm 6 Parallel matrix–vector multiplication of the -matrix of blockwise rank k (defined on block cluster quad-tree ) with vector . The index sets to be used in the first call to the recursive ParHMatrix-MVM procedure are such that . |
|
The pseudocode listed borrows the constructs used by tools such as OpenMP [44]: in particular, it uses the parallel for construct to indicate the distribution of the instructions included in its body among concurrent tasks, while the reduction instruction indicates that the different outputs of the vector must be added together at the end of the cycle.
The value of variable , at each level l of the block cluster quad-tree , is assumed to be such that , where is the cardinality of the set of sons of the index subset . If the value of divides such cardinality, at most, tasks are spawned, and each task executes new call to ParHMatrix-MVM procedure. If , the execution of the l-th level of Algorithm 6 coincides with that of Algorithm 3.
We recall that BLAS (Basic Linear Algebra Subprograms) are a set of routines [45] that provide optimized standard building blocks for performing primary vector and matrix operations. BLAS routines can be classified depending on the types of operands: Level 1: operations involving just vector operands; Level 2: operations between vectors and matrices; and Level 3: operations involving just matrix operands. The operation is called a GEMV operation when , , , and are, respectively, a matrix, two vectors, and two scalars.
The GEMV BLAS2 operation needed at lines 12 and 14 and 18 in Algorithm 6 is implemented by using the most effective component (identified by the macro ) of the computing architecture by a call to optimized mathematical software libraries available for that component (for example, the multithread version of the Intel MKL [46] library or the cuBLAS [47] library when using, as , respectively, the Intel CPUs or the NVIDIA GP-GPU accelerators. All the issues related to the most efficient memory hierarchy accesses are delegated to such optimized versions of the BLAS procedures.
In Figure 8, an example of the execution tree of Algorithm 6 is shown. In the left part of Figure 8, the block structure of the -matrix is defined on block cluster quad-tree , where is represented. The admissible and inadmissible blocks are represented, respectively, by yellow and red boxes. In the execution tree of Algorithm 6 (see the right part of Figure 8) the leaves are represented by a green box. At each of the two levels of the tree, the considered value for is . The following steps are executed:
Figure 8.
Example of the execution tree of Algorithm 6. (a) Block structure of the -matrix . (b) Execution tree of Algorithm 6: (b.1) execution on the 1-st subtree; (b.2) execution on the 3-rd subtree; (b.3) execution on the 4-th subtree.
- Starting from level , concurrent tasks are spawned, the task with identification number is related to a leaf, and then, the block being an admissible one, there is the procedure compute contribution to the sub-block of vector related to the index subset by code at lines 12 and 14 of Algorithm 6. Each of the remaining tasks execute a parallel for each spawning other concurrent tasks executing at the following level , with a total of concurrent tasks.
- At the level , all the block are leaves. If the blocks are admissible, they are used to compute contributions to the sub-blocks of vector by code at the lines 12 and 14 of Algorithm 6; otherwise, the same sub-blocks are updated by code at the line 18. In particular, assuming that the variable is used to represent the task identification number of a task spawned at the level l by a task with identifier, at the -th level, ,
- –
- tasks with identification numbers compute contributions to sub-blocks of related to the index subset and tasks with identification numbers update sub-blocks of related to the index subset . Then, all the tasks spawned from task (see Figure 8(b.1)) compute contributions to sub-blocks of related to the index subset .
- –
- In the same way, tasks with identification numbers compute contributions to sub-blocks of related to the index subset , and tasks with identification numbers compute contributions to sub-blocks of related to the index subset . Then, all the tasks spawned from task (see Figure 8(b.2)) compute contributions to sub-blocks of related to the index subset .
- –
- In the same way, the tasks with identification numbers (see Figure 8(b.3)) also compute contributions to sub-blocks of related to the index subset .
- At the termination of parallel for at level , the contributions to sub-blocks of related to the index subset and are summed together (by means of the reduce operation) to obtain the final status for vector .
3. Matrix Polynomial Evaluation
Let be a real polynomial of degree n of the matrix , where the set is the set of its coeffients.
Different methods can be used to evaluate polynomials defined by Equation (33) [1]. We propose the one of Paterson and Stockmeyer [48] in which is written as
where s is an integer parameter and
After the powers are computed, polynomials defined in Equation (34) can be evaluated by Horner’s method [1], where each is formed when needed.
The two extreme cases, and , reduce, respectively, to Horner’s method and to the method that evaluates polynomials via explicit powers.
The total cost of the polynomial evaluation is
where
and where denotes the computational cost of a matrix multiplication with .
, defined as in Equation (36), is approximately minimized by . From Equation (36), we can argue that the described method requires much less work than other ones, such as Horner’s method (whose computational cost is ), for large q.
In Algorithm 7, a procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method is presented. The version of Algorithm 7 based on HM representation of involved matrices, and hence on the algorithms introduced in Section 2, is listed in Algorithm 8.
| Algorithm 7 Procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method. and represent, respectively, the matrix and the results of Equation (33). n represents the degree of the polynomial and the vector of the polynomial coefficients . |
|
To evaluate Algorithm 8, where the involved matrices have an HM representation, we performed some tests using the same case studies described in the previous Section 2. The coefficients of the polynomial are generated randomly due to a uniform distribution in the interval .
The presented results (see Table 2) have the aim to show the evolution of the errors , as a function of the polynomial degree n, for the fixed value of , where
and where the symbols and represent
Table 2.
Polynomial evaluation test results: list of the errors as a function of the polynomial degree n.
- : the evaluation of the polynomial (33) through Algorithm 8 for the matrix ;
- : the natural representation of the evaluation of the polynomial (33) by means of Algorithm 8, where the involved matrices have an HM representation. In such a case, all the operations involving matrices (summation, product, and exponentiation) are based on Algorithms 2 and 5. In the summation operations, , the value of the result is computed as the maximum value between the values and of the operands and .
| Algorithm 8 Procedure for matrix polynomial evaluation based on Paterson and Stockmeyer method based on HM representation of involved matrices (defined on block cluster quad-tree ). and represent, respectively, the HM representation of the matrix and the results of Equation (33). n represents the degree of the polynomial and the vector of the polynomial coefficients . |
|
From Table 2, it can be observed that the evolution of the errors , as a function of the polynomial degree n, seems to be sufficiently sensitive to the matrix type even with the risk that such errors may explode. Such behavior can have serious consequences on the results of operations involving such kinds of polynomials. It is the case of the example shown in the case study described in Section 4, where the result of a matrix polynomial evaluation is used in a matrix–vector operation of the kind . If we denote by the error in representing and by the results of the operation , then we have
Therefore, from (39), it follows that the relative error of the results of perturbed matrix–vector operation has an upper limit depending on the norm of the error on matrix . So, if has a large value, also potentially has a large value.
Algorithm 7 for matrix polynomial evaluation based on Paterson and Stockmeyer method (and then Algorithm 8, which is its HM-based variant) lends itself well to parallel implementations; for example, if we imagine dividing the r matrix among concurrent tasks, then
- the s matrices and the r matrices (see lines 6 and 9 of Algorithm 7) could be computed by all the tasks concurrently and independently of each other (no communications are needed among tasks);
- during the update phase of polynomial (see line 20 of Algorithm 7), the local update of on each task can be performed concurrently (no communications are needed among tasks). To complete the computation of , just one collective communication is needed at the end of the algorithm.
Furthermore, every call to a function/procedure of the kind HMatrix-* in Algorithm 8 can be executed by each task using a locally available type of concurrency (a set of CPUs, cores, or accelerators) and fully exploiting the hierarchy of processing units (see Algorithm 6 for an example of such a strategy).
In Figure 9, we show an example of parallel implementation of Algorithm 7 (and then of Algorithm 8) by and . Each task may mapped to a node holding devices such as CPUs and accelerators. The computation of operations involving HM can be locally executed by an execution tree such as those shown in Figure 8.
Figure 9.
Example of parallel implementation of Algorithm 7 (and then of Algorithm 8) by using and .
4. A Case Study in Graph Convolutional Deep Neural Network Applications
As very effective tools for learning on graph-structured data (that stands for building a data-driven model from the same data), Graph Neural Networks (GNNs) have shown cutting-edge performance in a variety of applications, including biological network modeling and social network analysis. Compared to analyzing data separately, the special capability of graphs to capture the structural relationships between data allows for the extraction of additional insights. Among the GNNs, Graph Convolutional Deep Neural Network [49] appears as one of the most prominent graph deep learning models [50].
4.1. Introduction to Graph Convolutional Deep Neural Network
Graph Convolutional Deep Neural Networks are based on the theory of Spectral Analysis of a graph [50,51,52,53]. Spectral graph theory is the field concerned with the study of the eigenvectors and eigenvalues of the matrices that are naturally associated with graphs. One of the goals is to determine important properties of the graph from its graph spectrum.
Apart from theoretical interest, spectral graph theory also has a wide range of applications. Among them, we have to cite the construction of graph filters that are defined in the context of Discrete Signal Processing on Graphs (), whose aim is to represent, process, and analyze structured datasets that can be represented by graphs [52,54,55].
Let us introduce some definitions.
Definition 5
(Signal defined on a weighted undirected graph ). A weighted undirected graph is a triple , where is the set of nodes and is the set of edges. An edge is a couple of index such that nodes and are considered connected in graph .
The set of values is called the set of weights of , where
The matrix is the weighted adjacency matrix of the graph if
Assuming, without a loss of generality, that dataset elements are real scalars, we define a graph signal as a map from the set of nodes in the set of real numbers ℜ:
For simplicity of discussion, we write graph signals as vectors .
Definition 6
(Linear filters for signal defined on a weighted undirected graph ). A function is called a filter on a graph . A filter h on a graph is called a Linear Filter if a matrix exists such that
Definition 7
(Graph Fourier Transform of a signal defined on a weighted undirected graph ). Let us define the Graph Laplacian matrix of the weighted undirected graph as the following symmetric positive semidefinite matrix:
where is a diagonal matrix whose diagonal vector is defined as
Matrix is called degree matrix. In some cases, the Symmetric Normalized Laplacian is defined as
Let us consider the singular value decomposition (SVD) [38] of the Laplacian matrix ; then, due to the properties of , such decomposition can be written as
where the set of the columns of the matrix are a set of orthonormal vectors called Graph Fourier Modes and where the diagonal elements of the diagonal matrix Λ are non-negative and are identified as the frequencies of the graph.
Following Shuman et al. [56], the Graph Fourier Transform of a signal defined on an weighted undirected graph is
The inverse operation is the Inverse Graph Fourier Transform of signal .
Definition 8
(Convolution operation between signals defined on a weighted undirected graph ). Let and be two signals defined on a weighted undirected graph ; we can define the following convolution operation between signals and such that
where ⊙ represents the element-wise Hadamard product.
Definition 9
(Spectral filtering of a signal defined on a weighted undirected graph ). Let be a polynomial of the matrix with order K, where the set is the set of its coefficients; then,
Filtering a signal , defined on a weighted undirected graph , by a filter of the kind (49) is equivalent to the convolution operation , where
and where
As observed in [4,5], spectral filters represented by K-th order polynomials of the Laplacian are exactly K-localized. Indeed,
where is the shortest path distance, i.e., the minimum number of edges connecting two nodes and on the graph.
In the field of machine learning based on neural networks, the most widely used approach to build a data-driven model is related to a data fitting problem. In such a case, a function , defined through a set of k parameters , should be determined by a learning process on known information to be subsequently used to predict/describe new information.
With the term data fitting, we denote the process of constructing a mathematical function , (the model) that has the best fit to a series of data points . Curve fitting can involve either interpolation, where an exact fit to the data is required (i.e., , or smoothing, in which a smooth g function is constructed that approximately fits the data; that is,
for some small value for and some norm defined on .
Definition 10
(Graph Convolutional Deep Neural Network (DNN) defined on a weighted undirected graph ). Let be a DNN composed of L layers and let be the number of neurons in the l-th layer of . Suppose that the l-th layer of neurons represents the elements of a signal defined on a weighted undirected graph . Let us assume, for each l-th layer of , a spectral filter as defined in Definition 9.
The fitting function of of a Graph Convolutional DNN is defined as the following function compositions:
Regarding the composition of multiple functions in (53), let and be two functions. Then, the composition of f and g, denoted by , is defined as the function given by .
Given a set of M functions such that , with the symbol , it is intended
Each function in (53) is defined as the composed function
where is the so-called activation function, where
and where .
We observe that at the basis of both the learning and predicting phases of a Graph Convolutional DNN are the operations needed by Equation (55) that are, for each :
- given the values of parameters , the evaluation of matrix polinomial of type (49) that computes the matrix ;
- given the matrix , the computation of the matrix–vector product .
4.2. Preliminary Tests on Two Toy Examples from GC-DNN Context
In the following, some tests are presented to evaluate how the use of Algorithms 8 and 3, both based on HM representation of the involved matrices and, respectively, used in steps 1 and 2 above, impact the accuracy of the computation results. The tests aim to assess the use feasibility of HM-based methods during the relevant phases of a GC-DNN.
Two examples are used. They are based on graphs constructed as a result of two different types of affinity functions of the image represented in Figure 10. is grayscale image, composed of pixels, with 256 levels. In particular,
Figure 10.
An image for GC-DNN toy examples.
Example 1.
A mutual 5-nearest-neighbor weighted graph [51] is considered where the set of nodes coincides with the set of all the pixel values of . Two nodes are considered connected by an edge if both is among the 5 nearest neighbors of and is among the 5 nearest neighbors of . The distance between connected nodes is computed by the Spatial Gaussian affinity function [57] defined as
where is the coordinate vector of the pixel associated with the node and . The weighted adjacency matrix of the graph is defined as in Equation (40). The considered Laplacian matrix is the Symmetric Normalized one as defined in Equation (45) (see Figure 11a for the sparsity representation of ), and its condition number in L2 norm is .
Figure 11.
Sparsity representation of the Laplacian matrices for GC-DNN toy examples. (a) Example 1. (b) Example 2.
Example 2.
A weighted graph is considered where the set of nodes coincides with the set of all the pixel values of . Two nodes are considered connected by an edge if is among the 50 nearest neighbors of and is among the 50 nearest neighbors of in both of two distances: the Spatial Gaussian affinity function and the Photometric Gaussian affinity function [57]. The distance between connected nodes is then defined as
where is the coordinate vector of the pixel associated with the node , is the grayscale value of the pixel associated with the node , , and . The weighted adjacency matrix of the graph is defined as in Equation (40). The considered Laplacian matrix is based on Equation (43) (see Figure 11b for the sparsity representation of ), and its condition number in L2 norm is numerically infinite.
In both examples, no scaling operation is performed on the Laplacian matrix. Moreover, the HM representation of the Laplacian matrix considered is based on the following parameters: , . The coefficients of the polynomial are generated randomly as a result of a uniform distribution in the interval .
In Table 3, the following are listed for different values of the polynomial degree n and different values s for the factors degree:
Table 3.
Results from tests on GC-DNN toy examples.
- the norm , already defined in Section 3, of the difference between the results of the polynomial evaluation operations performed with and without an HM representation for matrices;
- the norm of the difference between the results of the matrix–vector operations performed with and without HM-based algorithms.
Looking at Table 3, the explosion of both the errors and attracts attention (especially for the second example). As suggested by Higham [58], the results of these preliminary tests confirm the problems related to the evaluation of matrix power, i.e., . In particular, what makes the difference is not only the matrix condition number but also the behavior of the sequence . In Figure 12, we report the trends of the L2 norm of the Laplacian matrices’ powers for both the GC-DNN toy examples as functions of the power index i.
Figure 12.
Trends, as function of the power index i, of the L2 norm of the Laplacian matrices’ powers for both the GC-DNN toy examples.
Therefore, while the possibility of using HM in the GC-DNN context remains an interesting option, much work remains to be conducted to (1) identify the matrix characteristics that make it less sensitive to error amplification or (2) to define mitigation strategies to reduce such amplification.
5. Conclusions and Future Work
Following a co-design approach that historically often guides and influences the evolution of the algorithms for the upcoming computing systems, this is the moment, at the dawn of the exascale Computing Era, to invest once again in the development of new algorithms that meet the growing need to ensure a high level of scalability and granularity.
In this context, methods based on hierarchical matrices (HMs) have been included among the most promising in the use of new computing resources precisely because of their strongly hierarchical nature.
This work aims to begin establishing the advantages and limitations of using HMs in operations such as the evaluation of matrix polynomials, which are crucial, for example, in the Graph Convolutional Deep Neural Network (GC-DNN) context. The presented tests show how the use of HMs still seems to be far from effective in complex contests such as matrix polynomial evaluation in real applications. These difficulties seem to be related to some characteristics of the matrices, such as their sparsity pattern or their max values [58].
So, bearing in mind the idea of building a truly effective and efficient tool that can make the most of modern supercomputing architectures, our future work will focus on the following: (1) a theoretical study of the characteristics of the matrices that make them more suitable (in terms of error propagation and memory occupancy) for an HM representation in the context of the evaluation of matrix polynomials; (2) the definition of a mitigation strategy for issues that lead to error amplification in order to achieve more stable algorithms [58]: techniques based on permutation, scaling, and/or normalization of the matrices could be considered; (3) the full implementation, in an HPC context, of the presented algorithms (i.e., see Algorithms 6 and 8) and the evaluation of their performances; (4) the validation of the proposed approach in a real application (i.e., from the strategic GC-DNN context).
Although the interest in hierarchical matrices is still high, even among those who develop mathematical software libraries (see for example the list available in [59]), and despite the high potential announced in works that, a decade ago, imagined the future for supercomputing [25], the current investment in the creation of parallel software libraries based on hierarchical matrices and their use seems very poor (one can mention the HMLib library [60] or the experiments from MAGMA developers [61,62]). We, therefore, hope that this work can be a stimulus in reviving interest in such tools in advanced computing contexts.
Author Contributions
Conceptualization, L.C.; methodology, L.C.; software, L.C.; validation, L.C. and V.M.; formal analysis, L.C. and V.M.; writing—original draft preparation, L.C.; writing—review and editing, L.C. and V.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Data will be made available on request.
Acknowledgments
Luisa Carracciuolo is a member of the “Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM)”. This work was carried out using the computational resources available at the scientific datacenter of the University of Naples Federico II (Naples, Italy).
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Higham, N. Functions of Matrices: Theory and Computation; Other Titles in Applied Mathematics, Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 2008. [Google Scholar]
- Wang, B.; Kestelyn, X.; Kharazian, E.; Grolet, A. Application of Normal Form Theory to Power Systems: A Proof of Concept of a Novel Structure-Preserving Approach. In Proceedings of the 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA, 21–25 July 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Schmelzer, T.; Trefethen, L.N. Evaluating matrix functions for exponential integrators via Carathéodory-Fejér approximation and contour integrals. Electron. Trans. Numer. Anal. 2007, 29, 1–18. [Google Scholar]
- Daigavane, A.; Ravindran, B.; Aggarwal, G. Understanding Convolutions on Graphs. Distill 2021. [Google Scholar] [CrossRef]
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
- Carracciuolo, L.; Lapegna, M. Implementation of a non-linear solver on heterogeneous architectures. Concurr. Comput. Pract. Exp. 2018, 30, e4903. [Google Scholar] [CrossRef]
- Mele, V.; Constantinescu, E.M.; Carracciuolo, L.; D’Amore, L. A PETSc parallel-in-time solver based on MGRIT algorithm. Concurr. Comput. Pract. Exp. 2018, 30, e4928. [Google Scholar] [CrossRef]
- Carracciuolo, L.; Mele, V.; Szustak, L. About the granularity portability of block-based Krylov methods in heterogeneous computing environments. Concurr. Comput. Pract. Exp. 2021, 33, e6008. [Google Scholar] [CrossRef]
- Carracciuolo, L.; Casaburi, D.; D’Amore, L.; D’Avino, G.; Maffettone, P.; Murli, A. Computational simulations of 3D large-scale time-dependent viscoelastic flows in high performance computing environment. J.-Non-Newton. Fluid Mech. 2011, 166, 1382–1395. [Google Scholar] [CrossRef]
- Carracciuolo, L.; D’Amore, L.; Murli, A. Towards a parallel component for imaging in PETSc programming environment: A case study in 3-D echocardiography. Parallel Comput. 2006, 32, 67–83. [Google Scholar] [CrossRef]
- Murli, A.; D’Amore, L.; Carracciuolo, L.; Ceccarelli, M.; Antonelli, L. High performance edge-preserving regularization in 3D SPECT imaging. Parallel Comput. 2008, 34, 115–132. [Google Scholar] [CrossRef]
- D’Amore, L.; Constantinescu, E.; Carracciuolo, L. A Scalable Space-Time Domain Decomposition Approach for Solving Large Scale Nonlinear Regularized Inverse Ill Posed Problems in 4D Variational Data Assimilation. J. Sci. Comput. 2022, 91. [Google Scholar] [CrossRef]
- Carracciuolo, L.; D’Amora, U. Mathematical Tools for Simulation of 3D Bioprinting Processes on High-Performance Computing Resources: The State of the Art. Appl. Sci. 2024, 14, 6110. [Google Scholar] [CrossRef]
- Reed, D.A.; Dongarra, J. Exascale Computing and Big Data. Commun. ACM 2015, 58, 56–68. [Google Scholar] [CrossRef]
- Petitet, A.; Whaley, R.C.; Dongarra, J.; Cleary, A. A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Innovative Computing Laboratory. 2000. Available online: https://icl.utk.edu/hpl/index.html (accessed on 1 April 2025).
- Top 500—The List. Available online: https://www.top500.org/ (accessed on 1 April 2025).
- Top 500 List—June 2022. Available online: https://www.top500.org/lists/top500/2022/06/ (accessed on 1 April 2025).
- Geist, A.; Lucas, R. Major Computer Science Challenges At Exascale. Int. J. High Perform. Comput. Appl. 2009, 23, 427–436. [Google Scholar] [CrossRef]
- Chen, W. The demands and challenges of exascale computing: An interview with Zuoning Chen. Natl. Sci. Rev. 2016, 3, 64–67. [Google Scholar] [CrossRef][Green Version]
- Kumar, V.; Gupta, A. Analyzing Scalability of Parallel Algorithms and Architectures. J. Parallel Distrib. Comput. 1994, 22, 379–391. [Google Scholar] [CrossRef]
- Kwiatkowski, J. Evaluation of Parallel Programs by Measurement of Its Granularity. In Proceedings of the Parallel Processing and Applied Mathematics; Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2002; pp. 145–153. [Google Scholar] [CrossRef]
- Ewart, T.; Cremonesi, F.; Schürmann, F.; Delalondre, F. Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex. ACM Trans. Math. Softw. 2020, 46, 28. [Google Scholar] [CrossRef]
- Munro, I.; Paterson, M. Optimal algorithms for parallel polynomial evaluation. In Proceedings of the 12th Annual Symposium on Switching and Automata Theory (SWAT 1971), East Lansing, MI, USA, 13–15 October 1971; pp. 132–139. [Google Scholar] [CrossRef]
- Keyes, D.E. Exaflop/s: The why and the how. Comptes Rendus. MÉcanique 2011, 339, 70–77. [Google Scholar] [CrossRef]
- Ang, J.; Evans, K.; Geist, A.; Heroux, M.; Hovland, P.D.; Marques, O.; Curfman McInnes, L.; Ng, E.G.; Wild, S.M. Report on the Workshop on Extreme-Scale Solvers: Transition to Future Architectures. Report, U.S. Department of Energy, ASCR, 2012. Available online: https://science.osti.gov/-/media/ascr/pdf/program-documents/docs/reportExtremeScaleSolvers2012.pdf (accessed on 1 April 2025).
- Greengard, L.; Rokhlin, V. A fast algorithm for particle simulations. J. Comput. Phys. 1987, 73, 325–348. [Google Scholar] [CrossRef]
- Martinsson, P.G. Fast Multipole Methods. In Encyclopedia of Applied and Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2015; pp. 498–508. [Google Scholar] [CrossRef]
- Cipra, B.A. The Best of the 20th Century: Editors Name Top 10 Algorithms. SIAM News 2000, 33, 1–2. [Google Scholar]
- Beatson, R.; Greengard, L. A Short Course on Fast Multipole Methods, 2001. Available online: https://math.nyu.edu/~greengar/shortcourse_fmm.pdf (accessed on 1 April 2025).
- Fenn, M.; Steidl, G. FMM and H-Matrices: A Short Introduction to the Basic Idea. Technical Report, Department for Mathematics and Computer Science, University of Mannheim, 2002. Available online: https://madoc.bib.uni-mannheim.de/744/ (accessed on 1 April 2025).
- Hackbusch, W.; Grasedyck, L.; Börm, S. An introduction to hierarchical matrices. Math. Bohem. 2002, 127, 229–241. [Google Scholar] [CrossRef]
- Hackbusch, W. Hierarchical Matrices: Algorithms and Analysis, 1st ed.; Springer Series in Computational Mathematics; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
- Borm, S.; Grasedyck, L.; Hackbusch, W. Hierarchical Matrices. Technical Report, Max-Planck-Institut für Mathematik, 2003. Report Number: Lecture Notes 21/2003. Available online: https://www.mis.mpg.de/publications/preprint-repository/lecture_note/2003/issue-21 (accessed on 1 April 2025).
- Yokota, R.; Turkiyyah, G.; Keyes, D. Communication Complexity of the Fast Multipole Method and its Algebraic Variants. Supercomput. Front. Innov. 2014, 1, 63–84. [Google Scholar] [CrossRef][Green Version]
- Saperas-Riera, J.; Mateu-Figueras, G.; Martín-Fernández, J.A. Lasso regression method for a compositional covariate regularised by the norm L1 pairwise logratio. J. Geochem. Explor. 2023, 255, 107327. [Google Scholar] [CrossRef]
- Serajian, M.; Marini, S.; Alanko, J.N.; Noyes, N.R.; Prosperi, M.; Boucher, C. Scalable de novo classification of antibiotic resistance of Mycobacterium tuberculosis. Bioinformatics 2024, 40, i39–i47. [Google Scholar] [CrossRef] [PubMed]
- Lim, H. Low-rank learning for feature selection in multi-label classification. Pattern Recognit. Lett. 2023, 172, 106–112. [Google Scholar] [CrossRef]
- Golub, G.H.; Van Loan, C.F. Matrix Computations, 3rd ed.; The Johns Hopkins University Press: Baltimore, ML, USA, 1996. [Google Scholar]
- Weisstein, E.W. Binary Tree. From MathWorld—A Wolfram Web Resource, 2024. Available online: https://mathworld.wolfram.com/BinaryTree.html (accessed on 1 April 2025).
- Grasedyck, L.; Hackbusch, W. Construction and Arithmetics of H-Matrices. Computing 2003, 70, 295–334. [Google Scholar] [CrossRef]
- Graphical Representation of Sparse Matrices. Available online: https://www.mathworks.com/help/matlab/math/graphical-representation-of-sparse-matrices.html (accessed on 1 April 2025).
- E20R5000: Driven Cavity, 20×20 Elements, Re = 5000. Available online: https://math.nist.gov/MatrixMarket/data/SPARSKIT/drivcav/e20r5000.html (accessed on 1 April 2025).
- BCSSTK24: BCS Structural Engineering Matrices (Eigenvalue Problems) Calgary Olympic Saddledome Arena. Available online: https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc3/bcsstk24.html (accessed on 1 April 2025).
- The OpenMP API Specification for Parallel Programming. Available online: https://www.openmp.org/specifications/ (accessed on 1 April 2025).
- Lawson, C.L.; Hanson, R.J.; Kincaid, D.R.; Krogh, F.T. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 1979, 5, 308–323. [Google Scholar] [CrossRef]
- BLAS and Sparse BLAS Routines of the Intel Math Kernel Library. Available online: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/blas-and-sparse-blas-routines.html (accessed on 1 April 2025).
- Basic Linear Algebra on NVIDIA GPUs. Available online: https://developer.nvidia.com/cublas (accessed on 1 April 2025).
- Paterson, M.S.; Stockmeyer, L.J. On the Number of Nonscalar Multiplications Necessary to Evaluate Polynomials. SIAM J. Comput. 1973, 2, 60–66. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
- von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
- Chen, X. Understanding Spectral Graph Neural Network. arXiv 2020, arXiv:2012.06660. [Google Scholar]
- Nikolaos, K. Spectral Graph Theory and Deep Learning on Graphs. Master’s Thesis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2017. [Google Scholar] [CrossRef]
- Sandryhaila, A.; Moura, J.M.F. Discrete Signal Processing on Graphs. IEEE Trans. Signal Process. 2013, 61, 1644–1656. [Google Scholar] [CrossRef]
- Sandryhaila, A.; Moura, J.M.F. Discrete signal processing on graphs: Graph fourier transform. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6167–6170. [Google Scholar] [CrossRef]
- Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef]
- Wobrock, D. Image Processing Using Graph Laplacian Operator. Master’s Thesis, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden, 2019. Available online: https://github.com/David-Wobrock/master-thesis-writing/blob/master/master_thesis_david_wobrock.pdf (accessed on 1 April 2025).
- Higham, N.J. Accuracy and Stability of Numerical Algorithms, 2nd ed.; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2002. [Google Scholar]
- Curated List of Hierarchical Matrices. Available online: https://github.com/gchavez2/awesome_hierarchical_matrices (accessed on 1 April 2025).
- HLIBpro: Is a Software Library Implementing Algorithms for Hierarchical Matrices. Available online: https://www.hlibpro.com/ (accessed on 1 April 2025).
- The Matrix Algebra on GPU and Multicore Architecture (MAGMA) Library Website. Available online: http://icl.cs.utk.edu/magma/ (accessed on 1 April 2025).
- Yamazaki, I.; Abdelfattah, A.; Ida, A.; Ohshima, S.; Tomov, S.; Yokota, R.; Dongarra, J. Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, 1 May 2018; Available online: https://icl.utk.edu/files/publications/2018/icl-utk-1049-2018.pdf (accessed on 1 April 2025).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).