GPGCN: A General-Purpose Graph Convolution Neural Network Accelerator Based on RISC-V ISA Extension

: In the past two years, various graph convolution neural networks (GCNs) accelerators have emerged, each with their own characteristics, but their common disadvantage is that the hardware architecture is not programmable and it is optimized for a speciﬁc network and dataset. They may not support acceleration for different GCNs and may not achieve optimal hardware resource utilization for datasets of different sizes. Therefore, given the above shortcomings, and according to the development trend of traditional neural network accelerators, this paper proposes and implements GPGCN: a general-purpose GCNs accelerator architecture based on RISC-V instruction set extension, providing the software programming freedom to support acceleration for various GCNs, and achieving the best acceleration efﬁciency for different GCNs with different datasets. Compared with traditional CPU, and traditional CPU with vector expansion, GPGCN achieves above 1001 × , 267 × speedup for GCN with the Cora dataset. Compared with dedicated accelerators, GPGCN has software programmability and supports the acceleration of more GCNs


Introduction
Since the GCNs hardware accelerator HYGCN [1] was proposed in 2020, various GCNs accelerators [1][2][3][4][5][6][7][8][9][10][11] have emerged, one after another, that are different in the calculation method, control flow, and scheduling algorithm, with different advantages in accelerating the GCNs [12,13], such as GCN [14], GIN [15], and GSC [16]. HYGCN [1] proposes a GCNs accelerator composed of an aggregation phase and a combination phase. ENGN [2] optimizes computation order for aggregation and combination to improve acceleration efficiency. AWB-GCN [3] optimizes the execution unit scheduling algorithm to balance the workload of each execution unit to improve the overall efficiency. However, they also have hidden downsides. A common disadvantage is that the hardware architecture is not programmable and it is optimized for a specific network and dataset. Their fixed calculation process may have a good acceleration effect for specific sizes and formats of datasets and certain GCNs, but not for other GCNs with different datasets because they do not have freedom of programmability.
Moreover, from the experience of the development history of traditional deep learning neural network accelerators, such as Cambricon [17], Grayskull [18] from Tenstorrent, and RASA [19] from Intel, it is understood that a software-programmable GCNs accelerator architecture based on instruction set is the trend of unified GCNs accelerator architecture in the future.
Traditional GCN hardware accelerators, such as HYGCN [1], ENGN [2], and AWB-GCN [3], which are dedicated accelerators, as shown on the right side of Figure 1, have the advantages of high energy efficiency and high execution rate. However, they also have less programmability or even non-programmability, which leads to the limitation of a fixed acceleration execution mode of the network.
In the general-purpose processor architecture, there are particular SIMD instruction set extensions for control-intensive operations with vector calculations, such as intel's

GCNs Analysis
Different from the traditional convolutional network, which processes data in Euclidean space, GCNs process data in non-Euclidean space, such as connection graph data. The calculation of GCNs generally consists of two phases: aggregation and combination, as shown in Figure 2. Aggregation fuses the feature vectors of each vertex adjacent point in some way. For example, the average sum and weighted sum are evaluated to obtain a new feature vector, as shown in the middle part of Figure 2. After aggregation, the feature vector of the vertex has information about its neighbor vertices. Then, combination uses the aggregated feature vector to perform a fully connected convolution calculation to obtain a low-dimensional feature vector, as shown in the right part of Figure 2. Combination extracts low-dimensional information from the features of the vertex and its neighbor vertices. Therefore, the unified mathematical expression of most GCNs is as (1), where A is the adjacency matrix, H is the feature matrix, and W is the weight matrix: GCNs networks = A · H · W (1) The process of aggregation + combination is repeated two or three times, and the final low-dimensional vector is used to complete tasks such as classification. Most of the GCNs aggregation process can be expressed as the multiplication of the weighted adjacency matrix A and the feature matrix H, composed of each vertex's feature vector. However, their respective aggregation characteristics are reflected in the adjacency matrix A difference. The adjacency matrix of GCN is calculated by the degree matrix, while the adjacency matrix of GAT is learned through the training process. Therefore, the aggregation process of these GCNs can be expressed as in Equation (2): Although aggregation can be uniformly expressed as matrix multiplication, due to the characteristics of graph datasets, the adjacency matrix A is always a matrix with high sparsity, that is, a large proportion of elements are 0. For example, as shown in Table 1, the adjacency matrix sparsity of the Cora dataset is 99.856%; the adjacency matrix sparsity of the Citeseer dataset is 99.918%, the adjacency matrix sparsity of the Pubmed dataset is 99.977%, and the adjacency matrix sparsity of the Nell dataset is 99.9942%. Therefore, the process of GCNs network aggregation is actually the process of sparse matrix multiplication sparse-GEMM. It can be seen from Table 1 that the sparsities of the adjacency matrices of the four typical graph datasets are all above 99%. However, when the sparseness of the matrix is large enough, even using the space-aware sparse matrix multiplication (sp-GEMM) to calculate, it will cause the elements with a value of 0 to occupy unnecessary hardware resources and waste time. Therefore, compared to the matrix operation, the aggregation calculation process is not computationally intensive, but is control-intensive. For controlintensive processing, it is more suitable to use the control logic + vector PE (process element) to complete, as shown in the upper part of Figure 3.  For such a sparse matrix operation, the correct approach is shown in Figure 4. The adjacency matrix A with extremely high sparsity is stored in CSR format. The feature matrix H is divided into vector format by row for operation. For the non-zero elements of each row in the adjacency matrix, the column coordinates of the non-zero elements stored in the CSR format are used as indexes, and the feature vector of the corresponding row of the feature vector matrix H is taken out for the aggregation operation.  The calculation of the combination phase of different networks is similar. Whether a fully connected convolution or a multilayer perceptron, it is a dense matrix multiplication operation (dense-GEMM) or a matrix with a certain degree of sparsity multiplied by a dense matrix. It is computation-intensive and is suitable for computing with a matrix-form computing array, as shown in the lower part of Figure 3.

Basic Features of GPGCN Custom Instruction Set Architecture
GPGCN compresses the encoding of macro instruction with many operations into RISC-V instruction, which has only 32 bits of encoding space.
The architecture registers of the vector/matrix are divided into source register (or rs register) and destination register (or rd register), and the rs register corresponds to the rd register one-to-one and is used together; that is to say, as long as the index of the rd register is specified in the instruction, the rs register index is also specified. Binding a pair of rs and rd registers together has three advantages: • We only need to specify the index of one register to operate two registers, which saves a lot of coding space for GPGCN custom instructions to encode other information. • It is consistent with the computational characteristics of the aggregation process. • It has a better scalability.

Custom CSR
Since there are two difficulties in designing macro instructions, one is to encode many pieces of instruction operation information required in the limited RISC instruction encoding space, and the other is to ensure the degree of programming freedom. The first problem is to provide some common auxiliary information between instructions through the custom CSR (current status registers) registers to solve and reduce instruction coding pressure. Table 2 shows the address space, the register's name, and the specific function description of the custom CSR registers of the GPGCN custom instruction set. 1 0x7c2 Feature matrix base address for aggregation The starting address of the feature matrix during aggregation calculation.

0x7c3
Feature vector length The number of elements of the feature vector representing a vertex in the feature matrix.

0x7c4
Pre-add feature matrix base address The starting address of another special feature matrix, which will be used when describing redundancy reduction techniques in the next chapter.

0x7c5
Result feature matrix base address The starting address of storing the result matrix during aggregation calculation.

0x7c6
Feature matrix base address for combination The starting address of the feature matrix during combined calculation.

0x7c7
Weight matrix base address The starting address of the weight matrix during combined calculation.

0x7c8 Da
A specific dimension of the adjacent matrix and the feature matrix when the result matrix is evaluated in the combined calculation. It will be used when describing the matrix type instruction below. The current configuration information of scratchpad memory, which will be described in detail when describing the configurable SCM hardware design in the next chapter. 14 0x7cf Matrix/vector mode Indicates whether the GPGCN hardware is in the matrix instruction mode or the vector instruction mode. The CSR register is used to distinguish which mode the hardware is in in the case of a hardware microarchitecture that integrates the vector and matrix registers and execution unit resources.

Register Extension
The GPGCN instruction set architecture contains two types of register group: the vector register group and the matrix register group. The ninth custom CSR register specifies the number of vector register pairs in the vector register group. Each pair of vector registers contains one vector rs register and one vector rd register, and each vector register contains 16 32-bit single-precision floating-point elements, as shown in Figure 5.  The 10th custom CSR register specifies the number of matrix registers in the matrix register group. Each pair of matrix registers contains two vector rs registers, one matrix rs register, and one matrix rd register, as shown in Figure 6. The vector rs register contains 8 32-bit single-precision float point elements. The matrix rs/rd register contains 8 × 8 32-bit single-precision float point elements to support 8 ×1 ×1 ×8 outer product operations.  Eight vector rs/rd registers can be combined into a matrix rs/rd register for use so that vector and matrix operations multiplex register resources and improve hardware utilization efficiency. The GPGCN hardware microarchitecture in the next chapter is also designed in this way.

Instruction Extension
According to the common characteristics of aggregation calculation and combination calculation in different GCNs, we designed four types of instructions extensions: vector type instructions, matrix type instructions, memory-access-related instructions, and special instructions, which correspond to aggregation, combination, memory access, and synchronization in the forward inferring process of different GCNs.
RISC-V provides four customizable instruction encoding spaces: custom0, custom1, custom2, and custom3, as shown in Figure 7. The GPGCN custom instruction set is implemented in these encoding spaces.

Vector Instruction Extension
It can be seen from Table 3 that the vector type instructions are subdivided into three categories: the basic category, the fix-rd category, and the fixed-rs category. The following describes the basic functions and design principles of vector-type instructions based on these three categories.

Basic Vector Instructions
The basic vector instruction is similar to the traditional SIMD instruction, and defines some basic vector load, store, add, and mov operations as the function complement of the fixed-rd and fixed-rs vector instructions.
Unlike the traditional SIMD instruction, the basic load/store instruction specifies the memory access address through the index stored in the RISC-V integer register. The index means the row coordinates of the feature vector to be accessed in the entire feature matrix.
The hardware will calculate the final memory access address through custom CSR1 (base address of feature matrix) and custom CSR2 (feature vector length) using Formula (3).
Thus, a basic vector load/store instruction is equivalent to a combination of three traditional scalar instructions and one traditional SIMD instruction, as shown in Figure 8.

Fixed-Rd Vector Instructions
The fixed-rd and fixed-rs instructions are designed according to GCNs network aggregation calculation characteristics. They characterize and complete the primary process of aggregation calculation in the GCNs network and provide a certain degree of programming freedom for different software schedule algorithms in the process of aggregation calculation.
The fixed-rd class vector instruction represents the fixed vector rd calculation mode in the aggregation calculation. As shown in Figure 9, the fixed vector rd calculation mode represents the aggregation calculation process of the feature vectors of all the neighbors of a vertex. Because it will always reuse a vector rd to store the intermediate results of the aggregation calculation, it needs to continuously load the feature vectors of different neighbors to the corresponding vector rs for accumulation until the final aggregation result of this vertex is calculated. Hence, the fixed vector rd is for multiplexing data (aggregated intermediate results) in vector rd.
Vector rd1 += feature vector of idx = 0 Vector rd1 += feature vector of idx = 2 Vector rd1 += feature vector of idx = 5  In the fixed-rd calculation mode, the data in the vector rd register are multiplexed. In contrast, the data in the vector rs register are not multiplexed, so there is no need to specify the index of the vector rs register. The vector rd is bound to the corresponding vector rs, which is also the theoretical basis for the paired definition of vector rs/rd described in the GPGCN custom instruction set architecture features above.
Then, the instruction encoding only needs to specify the index of vector rd, which saves a lot of other encoding space for the RISC-V instruction encoding. Therefore, the remaining encoding space can be used to fuse the loadvec instruction and the addvec instruction, making the vector instruction of GPGCN more macro and increasing the instruction information density, and the instruction bandwidth is improved.
Therefore, a fixed-rd vector instruction is equivalent to the combined operation of multiple traditional scalar and traditional vector instructions. As shown in Figure 10, the operations performed by the load-rs-add-rd-vec8/16 instruction include the following: calculate the address of the specified feature vector according to the index, then load the feature vector from memory to the corresponding vector rs according to this address, and then sum vector rd and vector rs and store the result into vector rd.  Even though the load-rs-add-rd-vec8/16 instruction already consists of multiple operations, due to the excellent mechanism of binding vector rd and vector rs, there is additional free coding space available, so this coding space can be used as the index of another integer register, which stores the row index of another feature vector so that a load-rs-add-rd-vec8/16 instruction can calculate the aggregation process of two feature vectors, and further increase instruction density, as shown in Figure 11.  In the same way as the above principle, for the aggregation calculation process with weight, such as the GAT network, the abovementioned extra free coding space can be used as the index of the floating-point register, and the value of the weight is stored in this floating-point register. There is one more floating-point multiplication operation of vector multiplication by a scalar (weight) in the calculation process, as shown in Figure 12.  The fixed-rd vector instruction also has the advantage in that the method of indexing the feature vector corresponds to the sparse storage format of CSR, as shown in Figure 13.  Figure 13. The computation process of fixed-rd mode with CSR data format.

Fixed-Rs Vector Instructions
The fixed-rs vector instruction represents the calculation mode of fixed vector rs in the aggregation calculation. Because a vertex may be a neighbor of multiple vertices, the feature vector of this vertex will be shared by the aggregation calculation of these neighbors. Therefore, the feature vector of this vertex is reusable in the aggregation calculation of different neighbors. The process of aggregation calculation can fix the feature vector of this vertex to the vector rs register and then load the intermediate results of the aggregation calculation of different neighbors to vector rd, and perform the aggregation calculation of these neighbors in turn, until all neighbors use the feature vector of the vertex to calculate one round, as shown in Figure 14. Therefore, the fixed-rs vector instruction can reuse the feature vector fixed in the vector rs register instead of reading from memory or cache every time. 8 += loadvec8/16 all_vector_rs (idx=8) (csr1,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=1) (csr4,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=2) (csr4,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=7) (csr4,csr2) load-rd-add-rs-store-rd-vec8/ 16   The fixed-rs vector instruction has one characteristic that differs from the fixed-rd instruction, which is that the vector rd instruction is no longer reused after it is calculated, and it needs to be stored back to the original address where it was loaded. This store does not need the extra information that occupies the coding space, so the fixed-rs instruction directly integrates the store operation into the instruction, which further increases the density of the instruction, as shown in Figures 15 and 16. load-rd-add-rs-store-rd-vec8/16 (idx) (csr4,csr2) mov rd1 base_address(in csr4) mul rd2 idx vector_length(in csr2) add rd3 rd1 rd2 loadvec8/16 vector_rd rd3 addvec8/16 vector_rd vector_rd vector_rs storevec8/16 vector_rd vector_rd rd3 Traditional scalar instr Traditional vector instr Figure 15. The load-rd-add-rs-store-rd-vec8/16 instruction equivalent.  Corresponding to the last advantage of the fixed-rd vector instructions, the fixed-rs instructions also have the advantage in that the method of indexing the intermediate results of different vertices and loading them into vector rd matches the CSC sparse storage format, as shown in Figure 17. 8 += loadvec8/16 all_vector_rs (idx=8) (csr1,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=1) (csr4,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=2) (csr4,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=7) (csr4,csr2) load-rd-add-rs-store-rd-vec8/16 (idx=8) (csr4,csr2)

Matrix Instruction Extension
The execution of the matrix-type instructions is closely related to the storage format of the matrix. Here, the storage format of the matrix is introduced first. As shown in Figure 18, the input and output matrices are divided into blocks in the combination operation of the GCNs. The block is indexed according to the index number. For the feature matrix of the size of a × b and the weight matrix of the size of b × c, which are also input matrices, the size of a block is 8 × Da elements, where Da is specified by custom CSR7; for the result matrix (output matrix) of the size of a × c, the size of a block is 8 × 8 elements. The traditional matrix instructions, such as ARM's SME instruction set extension, are to take the outer product of two vectors (assuming the lengths are n and m, respectively) to produce an n × m result matrix, that is, the instruction completes n × 1 × 1 × m = n × m vector outer product operation. In order to further increase the instruction density of the matrix instructions, the first matrix instruction in Table 4 specifies the value of Da through the custom CSR7 register and completes the matrix multiplication operation of the feature matrix of 8 × Da size with the weight matrix of Da × 8 size, that is, the matrix multiplication operation of 8 × Da × Da × 8 = 8 × 8, as shown in Figure 19.  The (idx1) and (idx2) of this instruction specify the block index of the feature matrix and weight matrix, respectively, which are used to calculate the starting address of the block data to be accessed by hardware using Formulas (4) and (5): weight matrix block addr = weight_matrix_base_addr(CSR6) + idx2 * 8 * Da(CSR7) The load-outerproduct-add-8*8 matrix_rd (idx1) (idx2) instruction is split into Da times as follows: load one column of the feature matrix block (eight elements), load the corresponding row of the weight matrix block (eight elements), and then perform the outer product of 8 × 1 × 1 × 8 to obtain the 8 × 8 result matrix and sum with matrix rd, and then store them in matrix rd, as shown in Figure 20.

Memory Access Extension
The scratchpad memory-related operation instructions in Table 5 include three instructions: The preload instruction preloads the feature matrix or weight matrix we want to access to the corresponding storage block of the scratchpad memory in advance.
The sync-preload instruction is used to indicate that the data in a block of scratchpad memory are no longer used and can be replaced.
The storescmback instruction is used to write a block of scratchpad memory back to the main memory. For example, after the fixed-rs instructions calculate the final result, the final result in scratchpad memory is written back to the main memory.

Fence Extension
Although the CPU fetches and sends the instructions of the GPGCN accelerator, the CPU does not know whether the GPGCN accelerator instructions have been executed in the accelerator, nor can it detect the address correlation between the CPU load/store instruction and the GPGCN memory access instruction. Therefore, in order to synchronize with the CPU instruction stream, the GPGCN-fence instruction in Table 6 is specially defined here to complete the synchronization operation between the GPGCN instruction stream and the CPU instruction stream: The gpgcn-fence instruction is usually followed by an RISC-V fence instruction. When all the older GPGCN instructions before gpgcn-fence are executed, the gpgcn-fence instruction is executed and can be committed in the CPU reorder buffer, otherwise the gpgcn-fence instruction will block at the commit head of the CPU reorder buffer, preventing the commit of subsequent CPU instructions. After the gpgcn-fence instruction is committed, the next RISC-V fence instruction can be committed, and the subsequent load/store memory access instructions can be executed. The memory access synchronization operation between the GPGCN instruction and the RISC-V CPU instructions is completed.

Overall Microarchitecture
The GPGCN hardware accelerator is coupled with the boom RISC-V cpu core through the rocc interface, as shown in Figure 22. The GPGCN instructions are pushed to the GPGCN accelerator for execution through the boom pipeline.
As shown in Figure 23, the hardware microarchitecture of the GPGCN accelerator consists of two parts: fused VPU (vector process unit) and configurable VMU (vector memory unit).
The fused VPU combines the execution units of vector instructions with the execution units of matrix instruction, that is, the execution unit array in the VPU can be configured as N SIMD8 vector pipelines to execute vector instructions or can be configured as M 8 × 8 array to calculate the matrix instructions, which improves the utilization efficiency of the execution units. Among them, N is specified by the custom CSR9 register, and M is specified by the custom CSR10 register. In implementing the GPGCN hardware microarchitecture of this design, considering the IPC and memory access bandwidth that the single-core CPU rocc interface can provide, the above parameters are designed as n = 8, m = 1.   The fused VPU includes the GPGCN custom instruction decoder (decoder), dispatch queue (dispatch queue), vector instruction issue queue (vector issue queue), matrix instruction issue queue (matrix issue queue), and an execution unit array that can be configured as eight SIMD8 vector lane pipelines or one 8 × 8 matrix pipeline.
The VMU can be configured into three different modes according to the execution of different instruction streams: the matrix mode, which supports the memory access mode of matrix instructions, the fixed-rd mode that supports the memory access mode of fixed-rd instructions in vector instructions, and the fixed-rs mode that supports the memory access mode of fixed-rs instructions.

Redundant Computation Reduction
There are many hidden redundant calculations in the aggregation calculation process of the GCNs. As shown in the dotted line box in Figure 24, vertices 1 and 2 need to accumulate the feature vectors of vertices 7 and 8; vertices 2, 3, and 4 need to accumulate the feature vectors of vertices 2, 3, and 4; vertices 5 and 6 need to accumulate the eigenvectors of vertices 5 and 6; and vertices 7 and 8 need to accumulate the feature vectors of vertices 1, 2 and vertices 7, 8.
In fact, these redundant accumulation calculations only need to be calculated once: for example, the feature vectors of vertices 7 and 8 are added in advance, and then the pre-addition result can be directly used when the aggregation of vertices 1, 2, 7, and 8 are calculated, which saves three vector addition operations and four vector load operations.  The scheme to achieve redundant calculation reduction in this design is to perform redundant calculation reduction of two consecutive vertices in the feature matrix for the accumulation operation: first, the sum of the feature vectors of two consecutive rows in the feature matrix is precomputed and stored in the pre-add feature matrix address space (address space specified by custom CSR3), as shown in Figure 25. Then, we use the hardware logic named converter in Figure 23 to identify the fixedrd vector instruction: load-rs-add-rd-vec8/16 vector_rd (idx1) (idx2) (CSR1,CSR2). The original execution step of this instruction is to retrieve the feature vector with the number of rows in the feature matrix equal to idx1, accumulate it into the vector rd register, and then retrieve the feature vector with the number of rows equal to idx2, and accumulate it into the vector rd register.
When the converter recognizes this instruction and judges that idx2 = idx1+1, the converter will convert load-rs-add-rd-vec8/16 vector_rd (idx1) (idx2) (CSR1,CSR2) instruction to load-rs-add-rd-vec8/16 vector_rd (idx) (CSR3,CSR2) instruction, where idx = idx1. This means that when two feature vectors that this instruction needs to accumulate are in two consecutive rows in the feature matrix, it only needs to retrieve and accumulate the feature vector specified by idx1 in the pre-add feature matrix to the vector rd register.
This way, the original two loads and two accumulation calculations are reduced to one load and one accumulation calculation.

Memory Access Optimization
In the VMU configuration in fixed-rd mode, there is a module that is unavailable in other modes: load accumulate buffer.
Since the SCM in fixed-rd mode is configured as four blocks, each block can only provide four read ports with overlapping bank addresses, while in fixed-rd mode, eight vector lane pipelines may access the VMU at the same time. It is possible that at the same time, there are eight load requests to access the same block with only four read ports, so there must be some load requests that must wait until the next cycle to successfully access.
In the fixed-rd mode, different vector lane pipelines load feature vectors from the feature matrix, and may load the same feature vector simultaneously. There may be data locality between load requests of different vector lane pipelines, as Figure 26 shows.
The load accumulate buffer uses the data locality hidden between load requests in the fixed-rd mode and uses a certain memory access delay in exchange for the overall memory access bandwidth.
The schematic diagram of the load accumulate buffer in Figure 27 is as follows: • It contains four queues, each of which corresponds to the read port of the corresponding bank of the SCM block. • The eight load requests from the eight vector lane pipelines enter different queues for temporary storage according to the least significant 2-bit addresses. • Each queue has gather logic, which judges whether the memory access addresses of up to n load requests at the head of the queue are equal, and merges load requests with equal memory access addresses into one load request access, then enters this load request into the read port of the bank corresponding to the SCM block. • When the SCM block returns the read result of the load request, it returns the result to the vector lane pipeline corresponding to all load requests before gathering. This process is called scatter. The load accumulate buffer not only converts eight load requests into four load requests but also utilizes the locality of the access data between load requests due to the gather mechanism so that the overall memory access bandwidth is not reduced.

Evaluation
Experimental environment: A GPGCN hardware accelerator is designed and implemented using chisel language under the chipyard [20] soc integration framework, and all performance data are obtained by Verilog simulation accurate to the clock cycle, in which the behavior of DDR is simulated using the dramsim2 [21] model and Micron's DDR3 timing model.
All GCNs use a two-layer structure, the feature vector length of the hidden layer is set to 16, and the forward calculation of all GCNs uses the calculation order of combination first and then aggregation.
The software adaptation method of the GCNs network under the GPGCN accelerator is that each SIMD vector lane calculates a corresponding vertex aggregation. The information of the dataset used is shown in Table 1, and the parameter configuration of the hardware is shown in Table 7. Each layer of the GCNs is divided into combination and aggregation for comparison. Compared with the traditional CPU, the execution efficiency of the GPGCN accelerator is significantly improved in the rest of the calculation process: • The acceleration ratio of aggregation is above 1300× for the Cora dataset and above 3300× for the Citeseer dataset. • The acceleration ratio of combination of the first layer network is about 700× for the Cora dataset and about 1500× for the Citeseer dataset. • The acceleration ratio of comb2 (the combination stage of the second layer network) is about 10× for both the Cora dataset and the Citeseer dataset. • The total acceleration ratio is about 1001× for the Cora dataset and 1937× for the Citeseer dataset.
The relatively smaller acceleration ratio of comb2 is due to the fact that the data in the combination stage of the second-layer network are relatively small, which is a dense matrix multiplication operation of n × 16 × 16 × 8, and no sparsity can be used to bring about a significant acceleration effeciency.
When comparing cpu with the traditional vector expansion, such as hwacha [23], although the computing resources of GPGCN are four times those of hwacha, the acceleration ratio is much more than 4×, which indicates that the GPGCN's acceleration efficiency is much higher: • The total acceleration ratio is about 267× for GCN with the Cora dataset and 460× for GCN with the Citeseer dataset.
Comparing the acceleration effects of different datasets under the same GCNs, the acceleration ratio of GPGCN on the Citeseer dataset is higher than that of the Cora dataset, because the Citeseer dataset has a higher sparsity that can be utilized, as shown in Table 1.
However, compared with dedicated accelerators such as HYGCN, the acceleration efficiency of GPGCN accelerators is relatively lower because HYGCN uses a lot more computing resources (about 72×) and larger on-chip cache and off-chip main memory bandwidth (about 50×). However, the speedup ratio of HYGCN for the Citeseer dataset is lower than that of the Cora dataset, indicating that HYGCN does not full use the sparsity of the dataset as does GPGCN, and because GPGCN has software programmability, it can accelerate the GAT network that HYGCN does not support.

Conclusions and Future Works
In this work, we pioneeringly propose the concept of GPGCN and design the GPGCN custom instructions based on RISC-V ISA extension. Then, we propose a general-purpose GCNs hardware accelerator based on the proposed GPGCN custom instructions with various optimized designs such as redundant computation reduction and load accumulate buffer.
The acceleration efficiency of the GPGCN accelerator based on RISC-V instruction extension is higher than that of CPU with traditional vector units. Compared with traditional CPU, GPGCN achieves above 1001× speedup for GCN with the Cora dataset and 1937× speedup for the Citeseer dataset. Compared with CPU with traditional vector units, GPGCN achieves above 267× speedup for GCN with the Cora dataset and 460× speedup for the Citeseer dataset.
Compared with dedicated accelerators, since GPGCN has better programmability and generality, it supports accelerating the GAT network, while HYGCN [1] does not support it.
Moreover, since GPGCN provides software programmability, our future work is to use the reorder algorithm to mine data locality in graph datasets to break the limitation of the memory wall and further improve the acceleration efficiency of the GPGCN accelerator.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author after publication. The data are not publicly available due to privacy or ethical restrictions.