Degree-Aware Graph Neural Network Quantization

In this paper, we investigate the problem of graph neural network quantization. Despite the great success on convolutional neural networks, directly applying current network quantization approaches to graph neural networks faces two challenges. First, the fixed-scale parameter in the current methods cannot flexibly fit diverse tasks and network architectures. Second, the variations of node degree in a graph leads to uneven responses, limiting the accuracy of the quantizer. To address these two challenges, we introduce learnable scale parameters that can be optimized jointly with the graph networks. In addition, we propose degree-aware normalization to process nodes with different degrees. Experiments on different tasks, baselines, and datasets demonstrate the superiority of our method against previous state-of-the-art ones.


Introduction
Different from regular data like images and videos, graph data are a special type of non-Euclidean irregular data, which cannot be directly processed by convolutional neural networks (CNNs).To remedy this, graph neural networks (GNNs) are developed to handle these irregularly structured data and are widely applied in applications like social networks [1], natural science [2,3], knowledge graphs [4], data mining [5] and recommendation systems [6,7].Although GNNs are commonly shallower than CNNs with fewer parameters, their computational cost are tightly related to the input graph size.Considering that the graph size ranges from hundreds of nodes to billions of nodes, the high computational cost of GNNs becomes one of its major obstacle in real-world scenarios and hinders potential applications on resource-limited devices.
To improve the efficiency of GNNs, numerous network compression techniques successfully applied in CNNs have attracted increasing interests, including low-rank factorization, network pruning, network quantization, and knowledge distillation.Among these techniques, network quantization aims to represent the full-bit values in the network with low-bit ones and employ efficient integer arithmetic instead of expensive floating point arithmetic.As a result, the memory consumption and the computational cost of the quantized networks are significantly reduced without changing the architecture.Considering these advantages, this technique shows great potential for GNN acceleration.
Despite being successfully applied in CNNs, directly extending network quantization to GNNs may suffer severe performance drop [8].Due to the high variance of node degree in a graph (ranging from one to hundreds or thousands), the magnitudes of the response for different nodes vary significantly in the GNN.This large range pose great challenge to existing network quantization methods and limits their accuracy.To address this issue, Tailor et al. [8] proposed Degree-Quant, which firstly considers node degree during network quantization and produces promising improvements.However, a heuristic scale parameter is used in Degree-Quant, which cannot flexibly adapt to diverse graph data.In addition, Degree-Quant relies on explicitly calculating the node degree to adaptively process nodes of varing degrees, which is quite time consuming.
To address the aforementioned issues, in this paper, we propose a degree-aware quantization for GNNs.Specifically, we introduce learnable scale parameters and optimize them together with the network parameters during training, allowing the quantizer to adapt to diverse graph data.In addition, we propose a simple yet effective degree-aware normalization method to normalize the response of nodes with different degrees to a common range.Our method is generic and compatible with different GNN architectures and tasks.The contributions of this paper can be summarized as follows:

•
We develop a degree-aware quantization method for GNNs.Our method leverages learnable scale parameters to achieve flexible adaption to various graph data and employs a degree-aware normalization to avoid the adverse effect of varying node degrees to network quantization.

•
We successfully apply our quantization method to typical GNN architectures and representative tasks (including node classification, graph classification, and graph regression).Extensive experiments validate the effectiveness of our designs and demonstrate the superiority of our method against previous approaches.

Related Work
In this section, we first briefly review several major works about graph neural network acceleration techniques.Then, we further discuss recent network quantization approaches that are closely related to our work.

Acceleration of Graph Neural Networks
Due to the capability of processing non-Euclidean data, GNNs have gained broad application and extensive research attention in various fields, such as complex systems [9] and social networks [10].However, due to the high computational complexity of GNNs, these methods cannot be scaled to large-scale data.In order to handle large-scale graph data on resource-limited devices, numerous efforts have been made to accelerate graph neural networks, which can be roughly divided into graph-based and model-based approaches.
Graph-based methods aim at improving the speed of graph neural network by accelerating their graph operations.Early graph neural networks process full graphs and suffer redundant computational cost.To remedy this, techniques like graph sampling [11], graph sparsification [12,13], and graph partition [14] are widely studied to sample partial graphs to reduce the graph sizes, remove unimportant edges to increase the sparsity of the graphs, and divide the full graph into sub-graphs to obtain smaller ones.By reducing the computational consumption of graph operations, these methods achieve promising speedup.
Different from graph-based methods, model-based ones aim at accelerating graph neural networks by improving the efficiency of model operations.Specifically, Wu et al. and He et al. developed light graph neural network SGC [15] and lightGCN [16], which leverage lightweight network architectures and operation flows to achieve efficient training and inference.Meanwhile, several works employ generic network acceleration techniques like network pruning [17], knowledge distillation [18] and network quantization [19] for speedup.These methods do not require novel network designs and can improve the inference efficiency of existing graph neural networks, thereby attracting increasing interests.The technique of knowledge distillation involves pre-training a complex teacher model and then utilizing distillation loss to transfer the knowledge from the teacher model to a compact student model.This allows the student model to retain the knowledge of the teacher model.For example, KDGCN [18] proposed a knowledge distillation technique termed Local Structure Preservation module (LSP) to transfer knowledge for GCNs.Additionally, the KD-framework [20] presents an effective knowledge distillation framework that aims to achieve more efficient and interpretable predictions.

Network Quantization
Network quantization aims at representing full-bit floating-point numbers in the neural network with low-bit ones to reduce the memory and computational consumption.
For example, by quantizing 32-bit weight and activation values to 4-bit ones, the model size is reduced to 1  8 , and the inference speed is significantly improved with the support of integer arithmetic, enabling efficient deployment of neural networks on FPGA platforms [21] or edge devices [22].
The feasibility and advantages of model quantization in traditional convolutional neural networks have been widely discussed.Networks such as BNN [23], TWN [24], and XNOR-Net [25] have been designed to quantize the weights to 1 or 2 bits, improving the inference speed at the cost of moderate performance drop.Inspired by the great success of network quantization in the area of convolutional neural networks [26], some studies extended this technique to graph convolutional neural networks.Specifically, Wang et al. [19] proposed Bi-GCN, which binarizes the input values and network parameters for speedup.In addition, to address the vanishing gradient issue during backpropagation caused by binarization, a new backpropagation method was designed for training.Tailor et al. [8] designed a quantization method tailored for GNNs, termed Degree-Quant.Specifically, they introduced a mask parameter to encourage nodes with higher degrees to retain their original accuracy, thereby avoiding the problem of large degree variations between different nodes.Despite promising performance, this approach introduces considerable memory and computational cost during training, which largely increases its training burden.

Self-Supervised Graph Representation Learning
In addition to high computational complexity, the annotation cost of graph data is also expensive and imposes challenges to GNNs.Motivated by the success of self-supervised learning in the field of natural language processing and computer vision, numerous efforts have been made to extend self-supervised learning to GNNs [27].By contrasting similar nodes against dissimilar ones, discriminative representations can be learned from unlabeled data in an unsupervised manner, which can be transferred to downstream graph-based tasks to speedup the training phase [28,29].Currently, self-supervised graph representation learning has drawn increasing interest.

Methodology
In this section, we first introduce the preliminaries.Then, we present our degree-aware quantization in detail.

Preliminaries
Network quantization aims at converting full-bit floating-point values in the network to low-bit ones to reduce the memory consumption and computational cost.Assuming that x represents a floating-point number and x q represents the quantized value, the quantization function can be defined as: Here, the clamp(•) function truncates the input number within the specified range, Q max and Q min are the maximum and minimum quantized values.For N-bit quantization, S and Z are the scale parameter and the zero point, respectively.These two parameters can be calculated as: where q max and q min are the maximum and minimum values for floating-point numbers.

Degree-Aware Quantization
Our quantization framework with learnable scale parameters and degree-aware normalization is illustrated in Figure 1.Take a graph convolution as an example, the response from the previous layer x is first fed to the degree-aware normalization to normalize the values to a small certain range.Then, the normalized values and the convolutional kernel values are passed to the clip, scaling, and quantization steps, resulting in x q and w q .Note that the scale parameters in the scaling step are differentiable and optimized by the task loss.With quantized values, graph convolution is conducted, and the result is then rescaled using the scale parameters.In this section, we first introduce the motivation.Then, we detail the degree-aware normalization and learnable scale parameter.

Motivation
We first conduct experiments to study the values to be quantized in GNNs.As illustrated in [8], the responses of nodes with different degrees vary a lot.As shown in Figure 2, the nodes produce higher responses after the aggregation layer x as the node degree increases.Therefore, the values to be quantized span a wide range, which is difficult to be well covered by the quantizer.Meanwhile, it can be observed that the variance of the response values σ shares a similar trend as the response values x.This observation inspires us to use the variance values to normalize the response values.In this way, the responses are constrained to a small range (the red line), which facilitates the quantizer to quantize these values with low quantization errors for higher accuracy.x represents the response value, and σ represents the variance of x.

Degree-Aware Normalization
In graph data, each node is connected to adjacent nodes, and this number of connected nodes is called the degree.Generally, the degree is highly uneven in a graph, ranging from one to hundreds or thousands.As discussed in Degree-Quant [8], the source of error in quantizing the graph convolutional neural network mainly comes from the aggregation phase.In this phase, nodes combine the feature information from its adjacent nodes in a permutation-agnostic manner.In the context of graph data, nodes with higher degrees collect more information from their adjacent nodes, resulting in a higher response after the aggregation phase.As a result, the range of full-bit responses for different nodes vary in a large range and introduce critical challenges for quantization.
The aggregation layer produces higher responses for nodes with higher degrees, shown in Figure 2.Meanwhile, the variance of response also increases linearly.Motivated by this observation, the aggregation responses are normalized by dividing their corresponding variances before being fed into the quantizer.This ensures that the input values produced by nodes with different degrees are constrained within a small certain range, facilitating the quantizer to reduce quantization errors for higher accuracy.Subsequently, we multiply the quantized results by the variance value again to ensure the range of the results.

Learnable Scale Parameter
It is demonstrated in [30] that the scale parameter significantly affects the accuracy of the quantized networks.As discussed in the preliminary section, the scale parameters in current methods are usually pre-defined and fixed, which cannot flexibly fit various datasets, networks, and tasks.To remedy this, we develop learnable scale parameters to make them trainable during backpropagation.According to the quantization function in Equation ( 1), the gradient of the scale parameter is derived as: In our experiments, the scale parameters are initialized using k times standard deviation of the data in the first batch.By using trainable scale parameters, the quantizer can adaptively fit the distributions of full-bit values in diverse networks developed for various tasks and datasets.

Experimental Results
In this section, we first introduce the implementation details.Then, we compare our method with previous ones in terms of both accuracy and efficiency.Finally, we conduct experiments to validate the effectiveness of our method.

Datasets and Metrics
We conduct experiments on node classification, graph classification, and graph repression tasks.In our experiments, we employ the Cora dataset, the REDDIT-BINARY dataset, the MNIST-Superpixels dataset, and the ZINC dataset for both training and evaluation.The details of these datasets are described in Table 1.For classification tasks, we evaluate the model performance based on its accuracy on the test set.For regression tasks, we employ the mean absolute error (MAE) between the predicted values and actual labels as metrics.

•
The Cora dataset [31] contains a single graph representing a citation network, where each node corresponds to a research paper.The edges between nodes represent the citation relationships among the papers.The node features are binary indicators indicating the presence or absence of specific words in the corresponding papers.The task on the Cora dataset is to classify the nodes into their respective labels.

•
The MNIST-Superpixels dataset is obtained from the MNIST dataset using SLIC [32].Each graph is derived from a respective image, and each node represents a set of pixels or superpixels sharing perceptual similarities.This dataset is widely applied for the graph classification task.

•
The REDDIT-BINARY dataset [33] consists of 2000 graphs corresponding to online discussions on the Reddit website.Each graph represents an online discussion thread, where nodes represent users and edges connect nodes if there has been a message response between the corresponding users.A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.This dataset is employed for the graph classification task.

•
The ZINC dataset consists of molecules graphs, where each node represents an atom.The task involves graph regression [34], specifically predicting the constrained solubility based on the graph representation of the molecules.In our experiments, three popular graph neural networks are employed as baselines, including the graph convolutional network (GCN), graph attention network (GAT), and graph isomorphism network (GIN).The setting of three baselines is presented in Table 2.In addition, as a message passing function [35] is the key process in GNNs, its formulation in different networks is detailed as follows.• Graph convolutional network (GCN) [36]: where d v represents the degree of node v.

•
Graph attention network (GAT) [37]: In GAT, attention coefficients α are introduced and calculated based on task-specific query vectors and input information, allowing for higher weights on more valuable feature information.The message passing function is defined as follows: Here, the self-attention mechanism is utilized: where e v,w represents the significance of node w to node v, a is a single-layer feedforward neural network, and LeakyReLU is a non-linear activation function.Finally, the attention coefficients are obtained through a normalization layer: • Graph isomorphism network (GIN) [38]: GIN leverages the isomorphism property of graphs to complete diverse graph-related tasks.Its message passing function is defined as follows: where f is a learnable injective function, such as a multi-layer perceptron (MLP), and is a learnable parameter.

Experimental Setup
In our experiments, we utilized the Adam optimizer with β 1 = 0.9 and β 2 = 0.999 for training.For GCN on the Cora dataset and REDDIT-BINARY datasets, the batch size was set to 128.Meanwhile, for GCN on the MNIST and ZINC datasets, the batch size was set to 5. Other detailed hyperparameters, including the learning rate and the number of epochs, are presented in Table 3.For our method, an additional parameter k was employed to determine the initialization range of quantized data.We set k = 3 for the Cora dataset and k = 1 for the REDDIT-BINARY dataset.We then conducted experiments to study the training efficiency of our method.Specifically, we measured the average training time on the REDDIT-BINARY dataset using GIN as the baseline.Similarly, DQ was employed as the state-of-the-art method for comparison.
As shown in Table 5, our method achieves a significant efficiency improvement with a 12.9× speedup as compared to DQ.As discussed in Section 3.2.2,DQ relies on explicitly calculating the node degree and introduces considerable computational cost.In contrast, our method leverages the efficient degree-aware normalization with only negligible additional cost, thereby achieving significant speedup.This further demonstrate the high efficiency of our method.

Model Analyses
In this subsection, we conduct ablation experiments to study different components in our method, including the learnable scale parameter and degree-aware normalization.

Learnable Scale Parameter
The learnable scale parameter enables the quantizer to flexibly adapt to various distributions of full-bit values in the network.To study its effectiveness, we replaced the learnable scale parameters with fixed ones for comparison.We conducted experiments on the Cora dataset.As shown in Table 6, the accuracy achieved by different graph networks suffers a notable drop when fixed-scale parameters are employed.This is because fixedscale parameters cannot well fit diverse data in the dataset.This clearly validates the effectiveness of our learnable scale parameters.As discussed in Section 3.2.2, the initialization of the learnable scale parameters is critical to the final performance.Consequently, we conduct experiments to study the effect of different initializations.Specifically, we set k to 0.5, 1, 3, and 5, and conduct experiments on the Cora dataset and the REDDIT-BINARY dataset for comparison.From Table 7, we can see that our method produces high accuracy on the Cora dataset and the REDDIT-BINARY dataset for k = 3.Consequently, k = 3 is used as the default setting of our method.During training, our learnable scale parameters are optimized jointly with the network parameters.Therefore, we further conduct experiments to investigate their convergence during training.Specifically, we visualize the convergence of scale parameters in Figure 3.As we can see, the scale parameter is updated during training and gradually reaches convergence to fit the distributions of float values.This validates the effectiveness and flexibility of our learnable scale parameter.

Degree-Aware Normalization
To address the issue that different nodes in a graph have different degrees, we employ degree-aware normalization to constrain the responses in a certain range.To validate its effectiveness, we remove this operation and compare its performance to our original method.The results are presented in Table 8.As we can see, our degree-aware normalization consistently introduces notable accuracy gains for different GNNs.For example, our degree-aware normalization produces an accuracy improvement of 4.6% on GCN for INT4 quantization.This demonstrates its effectiveness in handling the variation of node degrees in a graph.

Discussion
Our method shares a similar goal with Degree-Quant [8] to obtain low-bit GNNs for efficient inference.However, a heuristic scale parameter is used in Degree-Quant, which cannot flexibly adapt to diverse graph data.In addition, Degree-Quant relies on explicitly calculating the node degree to adaptively process nodes of varing degrees, which is quite time consuming.In contrast, our method leverages learnable scale parameters to achieve flexible adaption to various graph data and employs a degree-aware normalization to avoid the adverse effect of varying node degrees to network quantization.Moreover, one limitation of our method is that its accuracy suffers notable drops for bit widths lower than 4. In the future, we will conduct research to study binarized GNNs for further efficiency improvements.

Conclusions
In this paper, we propose a degree-aware network quantization method for graph neural networks.Specifically, we propose learnable scale parameters to fit various distributions of full-bit values in the network.In addition, we develop degree-aware normalization to handle the nodes with different degrees.Experiments demonstrate the effectiveness of our method against previous approaches on diverse tasks, datasets, and network architectures.

Figure 1 .
Figure 1.Quantization process with learnable scale and degree-aware normalization.

Figure 2 .
Figure 2. Analysis of values collected after aggregation at the final layer of FP32 GIN trained on Cora.x represents the response value, and σ represents the variance of x.

Table 1 .
Details of the datasets applied in this paper.

Table 2 .
Detailed parameters of the model architectures.

Table 4 .
Results produced by different methods on different datasets.

Table 5 .
Training time (s) for different methods.

Table 6 .
Accuracy (%) achieved by models with and without learnable scale parameters.

Table 7 .
Accuracy achieved by models with different initialization values of learnable scale parameters.

Table 8 .
Accuracy (%) achieved by models with and without degree-aware quantization.
epoch Figure 3. hlEvolution of the learnable scale parameter during training.