Grapher: A Reconfigurable Graph Computing Accelerator with Optimized Processing Elements

Junyong Deng; Songtao Lu; Baoxiang Zhang; Yanting Jia

doi:10.3390/electronics13173464

,

and

The School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

^*

Authors to whom correspondence should be addressed.

Electronics2024, 13(17), 3464;https://doi.org/10.3390/electronics13173464

Version Notes

Order Reprints

Abstract

In recent years, various graph computing architectures have been proposed to process graph data that represent complex dependencies between different objects in the world. The designs of the processing element (PE) in traditional graph computing accelerators are often optimized for specific graph algorithms or tasks, which limits their flexibility in processing different types of graph algorithms, or the parallel configuration that can be supported by their PE arrays is inefficient. To achieve both flexibility and efficiency, this paper proposes Grapher, a reconfigurable graph computing accelerator based on an optimized PE array, efficiently supporting multiple graph algorithms, enhancing parallel computation, and improving adaptability and system performance through dynamic hardware resource configuration. To verify the performance of Grapher, this paper selected six datasets from the Stanford Network Analysis Project (SNAP) database for testing. Compared with the existing typical graph frameworks Ligra, Gemini, and GraphBIG, the processing time for the six datasets using the BFS, CC, and PR algorithms was reduced by up to 39.31%, 35.43%, and 27.67%, respectively. The energy efficiency has also been improved by 1.8× compared to Hitgraph and 4.7× compared to ThunderGP.

Keywords:

graph computing accelerator; reconfigurable computing; processing element array; parallel configuration

1. Introduction

Graphs are important large data representations in the form of vertices and edges for modeling entity relationships in many application domains. In general, vertices represent entities in the graph, while edges represent relationships between entities. Different solutions have been proposed by academics and large technology companies such as Facebook, Google, Microsoft, etc. to organize and analyze the increasingly popular large graphs [1] such as the classical graph frameworks Ligra [2], Gemini [3], and GraphBIG [4]. Moreover, the size of these graphs increases rapidly, with potentially hundreds of billions of vertices and trillions of edges [5]. Due to the irregularity of graph accessing across different computing clusters, the lack of locality, and the inherent load imbalance distributions [6], graph computing has become a hot challenge. In the big data era, graph computing systems have been becoming increasingly important in dealing with graph-based analysis [7,8].

Despite the impressive capabilities of CPUs and GPUs in graph processing [7,9,10], they still suffer from critical issues such as control and memory divergence, load imbalance, and excessive global memory access. Furthermore, both CPUs and GPUs tend to have high power consumption. Both industry and academics have proposed various application-specific integrated circuits (ASICs) in the field of graph computation. However, ASICs are costly to build and have a high design complexity. Furthermore, they exhibit poor flexibility, making it difficult to address the diverse requirements of graph computing applications.

For graph computing, architectural innovation is imperative [11,12,13,14,15,16,17,18,19,20]. Reconfigurable graph computing accelerators combine the flexibility of software computing with the efficiency of hardware computing, which provide more efficient computational performance and improve upon the drawbacks of dedicated graph computing accelerators. They represent an effective approach to addressing the challenge of effectively handling diverse inputs in graph computing. GraphPulse [11] is an event-driven graph processing accelerator, which improves speed by optimizing event merging, prefetching, and stream scheduling. Polygraph [12] discussed the value of flexibility in graph computing accelerators, identified the classification of key algorithmic variants, and was able to modularly integrate the specialized features of each variant. TuNao [15] is a reconfigurable graph computing processor that aims to improve flexibility by exploring graph data locality and reducing off-chip memory accesses. DGRN [16] proposed a reconfigurable PE array composed of two heterogeneous PE units, namely aggregation PE and update PE, for aggregation and update operations under a unified architecture. Asiatici et al. [17] proposed a PE with efficient DMA data transmission, optimized data retrieval, and write operations, enabling efficient processing of graph algorithms. Hitgraph [18] achieved highly parallel execution by leveraging inter-partition and intra-partition parallelism to optimize algorithm execution.

However, the PE design of traditional graph computing accelerators is often optimized for specific graph algorithms or tasks, which limits their flexibility in processing different types of graph algorithms, or the parallel computing capability supported by some PE arrays may be inefficient. To address the above issues, this paper proposes a graph computing accelerator Grapher based on a reconfigurable PE array. Grapher can dynamically configure hardware resources to support different algorithms and enable parallel computation within multiple PEs, enhancing flexibility and resource utilization, significantly improving overall graph computing performance. It extracts common operations among different algorithms and different execution stages of the same algorithm. By analyzing different algorithms and data organization methods, a reconfigurable hardware circuit suitable for multiple algorithms is constructed.

The rest of the paper is organized as follows: Section 2 analyzes the preparation for Grapher including algorithm description, graph data organization, and accelerator architecture. Section 3 describes the proposed PE architecture. Section 4 describes the datapath configuration for different algorithms. Section 5 provides the experimental results and comparisons. Section 6 concludes this paper.

2. Preparation for Grapher

2.1. Graph Algorithms

Graph algorithms can be classified into three categories: Path Finding and Searching, Centrality Computation, and Community Detection. This paper selects three algorithms from each category, the Breadth First Search (BFS) algorithm from the Path Finding and Searching category, the PageRank (PR) algorithm from the Centrality Computation category, and the Connected Components (CC) algorithm from the Community Detection category, which are widely used or are the building blocks of other graph algorithms.

BFS [21] is an algorithm that calculates the minimum number of edges required to traverse from an initial vertex to all other vertices. The principle involves assigning a traversal label layer to each vertex, representing the number of edges needed to traverse from the current vertex to the initial root vertex. The initial label is set to 0, and iterations continue until all vertices have been processed, as shown in (1).

layer (v) = \min (layer (v), t + 1),

(1)

PR [22] is an algorithm for ranking the importance of web pages based on the linking relationships between them. Given a directed graph containing n nodes with a transfer matrix M, the non-zero elements of M have the same value and sum to 1. The PageRank of the directed graph is determined by iterating the limit vector R of (2). Input is a directed graph containing n nodes, a transfer matrix M, a damping factor d, an initial vector R₀, and a number of iterations t. The result is the PageRank vector R, which is calculated for the given directed graph. If R_t and R_t−1 are sufficiently close to or reach the pre-required number of iterations, let R = R_t and stop the iteration.

R_{t} = d M R_{t - 1} + \frac{1 - d}{n},

(2)

CC [23] is an algorithm for finding connected components in an undirected graph, where each node in a connected component can be connected from/to any other vertex in that component, and no vertex in that connected component will be connected to any other vertex outside of this component. The CC algorithm is implemented using the BFS algorithm by starting the traversal operation from a given point.

2.2. Formats of Graph Data

Due to the sparsity of graph data, the primary graph data formats include Coordinate (COO), Compressed Sparse Column (CSC), Compressed Sparse Row (CSR), Doubly Compressed Sparse Column (DCSC), etc. The graph data compression format used in this paper is Compressed Sparse Column Independently (CSCI) [24,25,26]. The compressed graph data in CSCI format contains three data identifiers. When ioc is “01”, index represents the column number and value represents the number of non-zero elements in the column; when ioc is “00”, index represents the row number and value represents the value of the corresponding non-zero element in the sparse adjacent matrix. In order to accommodate different algorithmic operations, improve the processing speed of hardware circuits, and reduce the storage space, this paper further modifies and adjusts the CSCI compression format.

To enhance parallelism in circuit operations, data storage separates index data, adjacency data, and vertex attribute data into sixteen groups per row for storage. The “ioc” data from the CSCI format is no longer stored. Each index datum contains three parts: vertex number, starting address of its adjacent vertices, and the number of adjacent vertices. Adjacency data consist of adjacent vertex numbers and the weight between the index vertex and its adjacent vertex. Storage of the vertex attributes includes vertex number and vertex attribute.

According to the requirements of BFS and CC, graph data can be divided into three sets: the first set for index data, the second set for adjacency data, and the third set for vertex attribute data. For PR, the non-zero elements are stored in the adjacent data storage unit and the weights are stored in the vertex attribute storage unit. The corresponding data are read from the vertex attribute storage unit based on the non-zero vertices for computation.

2.3. Architecture of Grapher

The overall architecture of Grapher is shown in Figure 1. Grapher mainly consists of six parts: data access unit, data storage unit, array input buffer, reconfigurable PE array, array output buffer, and controller.

Figure 1. Architecture of a graph computing accelerator.

The data access unit needs to determine the data access pattern based on the different graph algorithms being executed and the different execution stages of these algorithms. According to the diverse requirements of various graph algorithms, the data storage unit should be divided into three storage spaces, including index data, adjacent vertex data, and vertex attribute data.

The reconfigurable PE array performs algorithmic computation functions, with each PE capable of executing different operations by switching configuration contexts. The reconfigurable PE array performs the computation processing for different operators. The controller monitors the execution status of the entire circuit and generates control signals for different circuits in their current and next stages based on different execution statuses. At the same time, it issues configuration contexts to the reconfigurable PE array and constructs the internal datapaths of the PEs and the interconnections between PEs according to the execution processes of different algorithms and the different computational stages of the same algorithm. The array output buffer stores the output data from the reconfigurable PE array.

3. Design of Reconfigurable PE

To improve the speed of computation, multiple PEs are utilized to perform parallel data processing, allowing several PEs to simultaneously execute complex operations under corresponding configurations. Among the PE arrays, PEs are interconnected adjacently to transmit intermediate data and final results. When data accumulation is required, the controller first sequentially modifies the configuration context for 16 PEs. They transmit data from the leftmost and rightmost columns to the middle two columns. Subsequently, the configuration context of PEs needing data transmission is modified, and this process continues layer by layer until the last PE is reached, completing the data accumulation.

Figure 2 shows the micro-architecture of a PE, with its upper half detailing the configuration context. The configuration context is stored within the accelerator and is dispatched by the controller during runtime to switch the datapaths of the PE array, thereby altering its functionality. The configuration context is divided into three parts: conf0, conf1, and conf2.

Figure 2. The micro-architecture of PE.

conf0 controls the source of input data. Through MUX0, the input data can be received from any of the four adjacent PEs located in the north, south, east, or west directions, or directly from an external input.

conf1 is responsible for constructing the datapath. The controller configures different datapaths based on the different algorithms being executed or the different stages of the same algorithm. These constructed datapaths can also share the same hardware resources.

conf2 controls the destination of the output data, similar to the input data source. Through MUX9, the output data can be sent to any of the four adjacent PEs in the north, south, east, or west directions, as well as to external resources. There are three types of output data: the vertex layer number during the execution of the BFS algorithm, the connected component identifier during the execution of the CC algorithm, and the result of vector operations during the PR algorithm execution.

The lower half of the Figure 2 illustrates the internal circuitry of the PE, including basic functional units such as comparators, integer adders add0, fixed-point adders add1, add2, and fixed-point multipliers mul0, mul1. These units perform various functions required by the algorithms, such as layer traversal judgment, connected component determination, and data computation, according to the configuration context to achieve different functionalities.

4. Datapath Configuration

4.1. Datapath of BFS Algorithm

In Figure 3, the black portion depicts the datapath used during the execution of the BFS algorithm, and gray portion is the off circuit. The process begins with selecting input data based on the configuration context conf0. Subsequently, following the instructions of conf1, the input layer data and high-level signals are distributed through DeMUX2 to comparators for comparison. The purpose of this comparison operation is to determine whether a vertex has been traversed. If the comparison reveals that the vertex has been traversed, its vertex layer attribute remains unchanged. However, if the vertex has not been traversed, the result generated by the comparator controls MUX5 to output a valid signal. This valid signal then triggers the controller to raise the signal ch_y to change the vertex layer attribute. This change allows MUX7 to increment the vertex attribute by 1 in the adder add0, marking the vertex as visited. The modified vertex attribute is then stored in the reg for output. If the layer does not change, MUX7 selects the number 0 as the input signal for the adder. Finally, the processed data are output through MUX8 and directed as needed under the guidance of DeMUX3.

Figure 3. Datapath configuration of BFS algorithm.

4.2. Datapath of CC Algorithm

In Figure 4, the black portion represents the datapath for executing the CC algorithm, and gray portion is the off circuit. The core of this algorithm is to traverse connected components, which is accomplished using the BFS algorithm. Therefore, the datapath for executing the CC algorithm is similar to that used for the BFS algorithm. However, there are subtle differences between the two. During the traversal process, when the algorithm identifies a new connected component, the controller controls a signal called ch_c, which is then transmitted to the output of MUX6. Subsequently, MUX6 will further act to control MUX7 to perform a specific operation of incrementing the connected component by 1.

Figure 4. Datapath Configuration of CC Algorithm Stage I.

When executing the CC algorithm to find the next connected component in the graph, the algorithm needs to determine where to start the new traversal, i.e., to traverse all vertex attributes, look for vertices that have not yet been visited, and use them as the starting vertex for the next round of traversal. The datapath for this process is shown in Figure 5. The comparator is used to determine whether a vertex has been traversed. When an untraversed vertex is found, the comparison result outputs an enable signal, allowing the vertex attribute to be output as the starting vertex for the next round of traversal.

Figure 5. Datapath configuration of CC algorithm stage II.

4.3. Datapath of PR Algorithm

In Figure 6, the black portion represents the datapath used during the execution of the PR algorithm, and gray portion is the off circuit. When the PR algorithm starts executing, the input data are selected based on the configuration of conf1, determining the number of rows and a set of vector data. These data are then fed into mul0 for multiplication operation. The resulting product is processed by an accumulator and finally stored temporarily in reg. After processing a set of data, 16 intermediate results are generated and stored in 16 PEs. Subsequently, these intermediate results undergo an accumulation process, and the datapath for this process is illustrated in Figure 7. During the accumulation stage, MUX0 is responsible for selecting the intermediate results sent from adjacent PEs as the input data for the next round. These intermediate results are then passed to the adder add1, where they are summed with the current intermediate result calculated by the PE. Through this process, data flows and accumulates between PEs, ultimately completing the accumulation of intermediate results for the entire PR algorithm.

Figure 6. Datapath configuration of PR algorithm stage I.

Figure 7. Datapath configuration of PR algorithm stage II.

There are two output paths following this stage. One path directly sends the result to the adjacent PE, which further performs summation through MUX8. The other path involves multiplying the result by coefficients and adding a fixed value based on the PR algorithm formula. This operation only occurs in the last PE during the summation process. After this operation is completed, the computation for the set of vectors is fully finished.

5. Results

In this section, a prototype system is designed for Grapher based on the AMD Virtex UltraScale+ FPGA VCU118 evaluation kit. Different datasets from the Stanford Network Analysis Project (SNAP) [27] database were selected to test Grapher. The run time for six datasets was compared with the selected Ligra, Gemini, and GraphBIG frameworks and an analysis was conducted on the resource utilization, power consumption, throughput and energy efficiency in comparison to other graph frameworks or accelerators.

5.1. Grapher Testbench

In order to validate the performance of the Grapher accelerator designed in the thesis, the thesis selects six test datasets to test Grapher. Before testing, the original graph data are first converted to the CSCI data format required by the different algorithms using C code; then, the CSCI data are organized according to the data organization in Section 2.2; and finally, the graph data are stored in on-chip RAM through vivado synthesis, and the synthesised bitstream is programmed to the development board. The hardware test part is programmed to the on-chip accelerator as well as the external control interface, which controls the enable reset signal of the accelerator and gives the end flag, as shown in Figure 8.

Figure 8. Grapher testbench. (PC (Personal Computer), FPGA (Field-Programmable Gate Array)).

Users first convert the graph data into the CSCI compression format supported by the accelerator. Subsequently, they perform data extraction to separate index data and adjacent vertex data from the CSCI compression format. Then, based on the number of vertices in the graph data, users generate corresponding initial vertex attributes. At the same time, users store these index data, adjacent vertex data, and vertex attribute data according to the data storage method described in Section 2.2. Through this series of operations, users successfully convert the graph data into a format that the accelerator can efficiently process, preparing it for subsequent computation and analysis.

5.2. Graph Data Selection

The data-specific information, including the number of vertices and the number of edges of the dataset, are shown in Table 1. The Grapher accelerator performance is tested with six selected datasets.

Table 1. Datasets selected for testing.

The attributes of the graph data itself mainly include directed and undirected, and weighted and unweighted with connectivity. In this paper, the selected graph data SE is a directed unweighted graph, FB is an undirected unweighted graph, WT is a directed unweighted graph, EE is a directed single-connected unweighted graph, PA is an undirected unweighted graph, and TX is an undirected unweighted graph.

5.3. Comparison of Runtime

This paper analyses the runtime of the proposed Grapher implementation of the three algorithms using the selected dataset in comparison with the runtime in the existing typical graph frameworks Ligra [2], Gemini [3], and GraphBIG [4] when processing the same dataset. The overall runtime of the algorithm is averaged over each edge. When counting the runtime of the three graph frameworks, only the algorithm runtime is counted and the graph data loading time is not included in the overall execution time. Before processing the graph data, the original data format needs to be converted into three different graph frameworks and the data formats required by the accelerators designed in this paper. The time taken for each format conversion is approximately the same, and this time is also not included in the algorithm’s execution time.

The time comparison between the three graph frameworks and Grapher’s execution of the BFS algorithm is shown in Figure 9, where Grapher processes the six datasets with a decrease of 24.23% to 39.31% relative to Ligra, 14.15% to 23.01% relative to Gemini, and 17.44% to 30.93% relative to GraphBIG.

Figure 9. Comparison of execution time for the BFS algorithm.

The time comparison of the three graph frameworks with Grapher for executing the CC algorithm is shown in Figure 10, which shows that Grapher decreases from 20.24% to 35.43% with respect to Ligra, 8.30% to 18.03% with respect to Gemini, and 13.74% to 26.97% with respect to GraphBIG.

Figure 10. Comparison of execution time for the CC algorithm.

The time comparison of the three graph frameworks with Grapher for executing the PR algorithm is shown in Figure 11, which shows that Grapher decreases by 17.07% to 27.67% with respect to Ligra, 5.95% to 22.26% with respect to Gemini, and 11.56% to 24.79% with respect to GraphBIG.

Figure 11. Comparison of execution time for the PR algorithm.

5.4. Comparison of Resource Utilization and Performance

In this section, a comprehensive analysis of the resource occupancy of the hardware resources of the Grapher accelerator on the VU9P FPGA is carried out, and the statistical results of the resource usage of Grapher after the synthesis of the accelerator circuits are shown in Table 2. The resource consumption of Grapher on the LUT, FF, and BRAM is 42,191, 21,316, and 1546, respectively, and the resource utilization of the on-chip resources of the VU9P is 3.57%, 0.90%, and 71.6%, respectively. The accelerator’s occupation of the overall resources is low. Table 3 shows the FPGA platform, clock frequency, power consumption, throughput, and energy efficiency used by Grapher, [17], HitGraph [18], ThunderGP [28], and ForeGraph [29].

Table 2. Resource utilization of Grapher.

Table 3. Comparison of hardware resource consumption and performance.

Power consumption is an important metric for evaluating the performance of graph accelerators. Increased power consumption can lead to higher heat generation, which can affect device stability and lifespan. When comparing the accelerator Grapher with existing accelerators or graph frameworks such as ThunderGP, HitGraph, and that in [17], it is found that ThunderGP and that in [17] exhibit relatively high power consumption, while Grapher and HitGraph have lower power consumption. However, Grapher’s power consumption is significantly lower than that of the other two accelerators.

Graph computing typically involves large-scale datasets and complex computational operations, making high throughput crucial for accelerators. Meanwhile, high energy efficiency enables graph computing accelerators to consume less energy while achieving the same workload. Therefore, both throughput and energy efficiency are essential for graph computing accelerators. When compared to existing accelerators or frameworks such as [17], HitGraph, ThunderGP, and ForeGraph, Grapher’s throughput falls between the four, but its energy efficiency far surpasses them. The energy efficiency has been improved by 1.8× compared to Hitgraph, 4.7× compared to ThunderGP, and 7.2× compared to [17]. Furthermore, compared to systems integrating FPGA with CPU, such as GraphScale and that in [30], we still provide superior performance.

6. Conclusions

This paper proposes a graph computing accelerator based on a reconfigurable PE array that supports multiple algorithms, which greatly improves hardware utilization and provides a flexible hardware platform. Through extensive experiments and processing of different algorithms and datasets, the superiority of Grapher has been verified. At the same time, it lays the foundation for future research, with several challenging aspects. To address the size of the array, the performance of Grapher can be improved by expanding the PE array, which can greatly avoid performance bottlenecks caused by high-dimensional graph features and improve architecture parallelism.

Author Contributions

Conceptualization, J.D. and S.L.; methodology, S.L.; software, B.Z. and Y.J.; validation, S.L.; formal analysis, S.L.; investigation, S.L.; resources, J.D.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, J.D. and S.L.; visualization, S.L.; supervision, J.D.; project administration, J.D.; funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Major Project (No. 2022ZD0119001), National Natural Science Foundation of China (No. 61834005), Shaanxi Key Research and Development Project (No. 2022GY-027), and the Key Scientific Research Project of Shaanxi Department of Education (No. 22JY060).

Data Availability Statement

Publicly available datasets were analyzed in this study. The ImageNet dataset can be found here: https://snap.stanford.edu/ (accessed on 5 March 2022).

Acknowledgments

We would like to thank all reviewers for their helpful comments and suggestions regarding this paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Sakr, S.; Bonifati, A.; Voigt, H. The future is big graphs: A community view on graph processing systems. Commun. ACM 2021, 64, 62–71. [Google Scholar] [CrossRef]
Shun, J.; Blelloch, G.E. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Shenzhen, China, 23 February 2013; pp. 135–146. [Google Scholar]
Zhu, X.; Chen, W.; Zheng, W. Gemini: A {Computation-Centric} distributed graph processing system. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 301–316. [Google Scholar]
Nai, L.; Xia, Y.; Tanase, I.G.; Kim, H.; Lin, C. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, TX, USA, 15 November 2015; pp. 1–12. [Google Scholar]
Ching, A.; Edunov, S.; Kabiljo, M.; Logothetis, D.; Munthuktishnan, S. One trillion edges: Graph processing at facebook-scale. PVLDB 2015, 8, 1804–1815. [Google Scholar] [CrossRef]
Jo, Y.Y.; Jang, M.H.; Kim, S.W.; Park, S. Realgraph: A graph engine leveraging the power-law distribution of real-world graphs. In Proceedings of the World Wide Web Conference, New York, NY, USA, 13 May 2019; pp. 807–817. [Google Scholar]
Segura, A.; Arnau, J.M.; González, A. SCU: A GPU stream compaction unit for graph processing. In Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 22 June 2019; pp. 424–435. [Google Scholar]
Deng, J.; Wu, Q.; Wu, X.; Song, S.; Dean, J.; John, L.K. Demystifying graph processing frameworks and benchmarks. Sci. China Inf. Sci. 2022, 63, 229101. [Google Scholar] [CrossRef]
Brahmakshatriya, A.; Zhang, Y.; Hong, C.; Kamil, S.; Shun, J.; Amarasinghe, S. Compiling graph applications for GPU s with GraphIt. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Seoul, Republic of Korea, 27 February 2021; pp. 248–261. [Google Scholar]
He, L.; Liu, C.; Wang, Y.; Liang, S.; Li, H.; Li, X. Gcim: A near-data processing accelerator for graph construction. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 205–210. [Google Scholar]
Rahman, S.; Abu-Ghazaleh, N.; Gupta, R. Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 908–921. [Google Scholar]
Dadu, V.; Liu, S.; Nowatzki, T. Polygraph: Exposing the value of flexibility for graph processing accelerators. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 595–608. [Google Scholar]
Dann, J.; Ritter, D.; Fröning, H. GraphScale: Scalable processing on FPGAs for HBM and large graphs. Proc. ACM Trans. Reconfigurable Technol. Syst. 2024, 17, 1–23. [Google Scholar] [CrossRef]
Hu, Y.; Du, Y.; Ustun, E.; Zhang, Z. GraphLily: Accelerating graph linear algebra on HBM-Equipped FPGAs. In Proceedings of the 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD), Munich, Germany, 1–4 November 2021; pp. 1–9. [Google Scholar]
Zhou, J.; Liu, S.; Guo, Q.; Zhou, X.; Zhi, T.; Liu, D.; Wang, C.; Zhou, X.; Chen, Y.; Chen, T. Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing. In Proceedings of the 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, Spain, 14–17 May 2017; pp. 731–734. [Google Scholar]
Yang, C.; Huo, K.B.; Geng, L.F. DRGN: A dynamically reconfigurable accelerator for graph neural networks. J. AMB Intel. Hum. Comp. 2023, 14, 8985–9000. [Google Scholar] [CrossRef]
Asiatici, M.; Ienne, P. Large-scale graph processing on FPGAs with caches for thousands of simultaneous misses. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 609–622. [Google Scholar]
Zhou, S.; Kannan, R.; Prasanna, V.K.; Seetharaman, G.; Wu, Q. Hitgraph: High-throughput graph processing framework on fpga. IEEE TPDS 2019, 30, 2249–2264. [Google Scholar] [CrossRef]
Liang, S.; Wang, Y.; Liu, C.; He, L.; Li, H.; Xu, D.; Li, X. Engn: A high-throughput and energy-efficient accelerator for large graph neural networks. IEEE Trans. Comput. 2020, 70, 1511–1525. [Google Scholar] [CrossRef]
Gepner, P.; Kocot, B.; Paprzycki, M.; Ganzha, M.; Moroz, L.; Olas, T. Performance Evaluation of Parallel Graphs Algorithms Utilizing Graphcore IPU. Electronics 2024, 13, 2011. [Google Scholar] [CrossRef]
Bundy, A.; Wallen, L. Breadth-first search. In Catalogue of Artificial Intelligence Tools; Springer: Berlin/Heidelberg, Germany, 1984; p. 13. [Google Scholar] [CrossRef]
Ma, N.; Guan, J.; Zhao, Y. Bringing PageRank to the citation analysis. Inf. Process. Manag. 2008, 44, 800–810. [Google Scholar] [CrossRef]
Di Stefano, L.; Bulgarelli, A. A simple and efficient connected components labeling algorithm. In Proceedings of the 10th International Conference on Image Analysis and Processing, Venice, Italy, 27–29 September 1999; pp. 322–327. [Google Scholar]
Deng, J.; John, L.K.; Song, S. A Graph Data Compression Method for Graph Computing Accelerator and Graph Computing. Chinese Patent CN201910107925.9, 21 June 2019. (In Chinese). [Google Scholar]
Deng, J.; John, L.K.; Song, S. A Parallel Graph Computing Accelerator Structure. Chinese Patent CN201910107937.1, 28 June 2019. (In Chinese). [Google Scholar]
Ren, H.; Deng, J. Characterization analysis of the impact of graph data compression format on breadth-first search algorithm. J. Zhengzhou Univ. (Nat. Sci. Ed.) 2021, 53, 26–33. (In Chinese) [Google Scholar]
Leskovec, J.; Sosič, R. Stanford network analysis platform. ACM TIST 2016, 8, 1–20. [Google Scholar]
Chen, X.; Cheng, F.; Tan, H.; Chen, Y.; He, B.; Wong, W.; Chen, D. ThunderGP: Resource-efficient graph processing framework on FPGAs with hls. ACM Trans. Reconfig. Technol. Syst. 2022, 15, 1–31. [Google Scholar] [CrossRef]
Dai, G.; Huang, T.; Chi, Y.; Xu, N.; Wang, Y.; Yang, H. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22 February 2017; pp. 217–226. [Google Scholar]
O’Brien, F.; Agostini, M.; Abdelrahman, T.S. A streaming accelerator for heterogeneous CPU-FPGA processing of graph applications. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, 11 February 2021; pp. 26–35. [Google Scholar]

Figure 1. Architecture of a graph computing accelerator.

Figure 2. The micro-architecture of PE.

Figure 3. Datapath configuration of BFS algorithm.

Figure 4. Datapath Configuration of CC Algorithm Stage I.

Figure 5. Datapath configuration of CC algorithm stage II.

Figure 6. Datapath configuration of PR algorithm stage I.

Figure 7. Datapath configuration of PR algorithm stage II.

Figure 8. Grapher testbench. (PC (Personal Computer), FPGA (Field-Programmable Gate Array)).

Figure 9. Comparison of execution time for the BFS algorithm.

Figure 10. Comparison of execution time for the CC algorithm.

Figure 11. Comparison of execution time for the PR algorithm.

Table 1. Datasets selected for testing.

Graph Data Type	Datasets	Number of Vertices	Number of Edges
Social networks	Soc-Epinions1 (SE)	75,879	508,837
Social networks	ego-Facebook (FB)	4039	88,234
Communication networks	wiki-Talk (WT)	2,394,385	5,021,410
Communication networks	email-Enron (EE)	36,692	420,045
Road networks	roadNet-PA (PA)	1,088,092	1,541,898
Road networks	roadNet-TX (TX)	1,379,917	1,921,660

Table 2. Resource utilization of Grapher.

Logic Device	Resource Consumption	Resources Available	Resource Utilization
LUT	42,191	1,182,240	3.57%
FF	21,316	2,364,480	0.90%
BRAM	1546	2160	71.6%

Table 3. Comparison of hardware resource consumption and performance.

Logic Device		FPGA	Resource Consumption	Clock Frequency (MHz)	Power (W)	Throughput (MTEPS)	Energy Efficiency (MTEPS/W)
Grapher	LUT	XCVU9P	3.57%	150	3.988	2265	568.0
	FF		0.90%
	BRAM		71.6%
Hitgraph	LUT	XCVU5P	68.1%	200	10.7	3410	318.7
	FF		26.1%
	BRAM		9.2%
ThunderGP	LUT	XCVU9P	84.0%	241	46	5510	119.8
	FF		-
	BRAM		66.0%
ForeGraph	LUT	XCVU9P	31.2%	200	-	1846	-
	FF		-
	BRAM		89.4%
[17]	LUT	UltraScale+	75.0%	227	23	1800	78.3
	FF		42.0%
	BRAM		39.0%
GraphScale	LUT	PAC D5005	19.0%	192	-	1510	-
	FF		-
	BRAM		40.0%
[30]	LUT	Arria 10 GX1150	-	291	-	1492	-
	FF		-
	BRAM		-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Grapher: A Reconfigurable Graph Computing Accelerator with Optimized Processing Elements

Abstract

1. Introduction

2. Preparation for Grapher

2.1. Graph Algorithms

2.2. Formats of Graph Data

2.3. Architecture of Grapher

3. Design of Reconfigurable PE

4. Datapath Configuration

4.1. Datapath of BFS Algorithm

4.2. Datapath of CC Algorithm

4.3. Datapath of PR Algorithm

5. Results

5.1. Grapher Testbench

5.2. Graph Data Selection

5.3. Comparison of Runtime

5.4. Comparison of Resource Utilization and Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics