Perceptual Hash of Neural Networks

: In recent years, advances in deep learning have boosted the practical development, distribu-tion and implementation of deep neural networks (DNNs). The concept of symmetry is often adopted in a deep neural network to construct an efﬁcient network structure tailored for a speciﬁc task, such as the classic encoder-decoder structure. Massive DNN models are diverse in category, quantity and open source frameworks for implementation. Therefore, the retrieval of DNN models has become a problem worthy of attention. To this end, we propose a new idea of generating perceptual hashes of DNN models, named HNN-Net (Hash Neural Network), to index similar DNN models by similar hash codes. The proposed HNN-Net is based on neural graph networks consisting of two stages: the graph generator and the graph hashing. In the graph generator stage, the target DNN model is ﬁrst converted and optimized into a graph. Then, it is assigned with additional information extracted from the execution of the original model. In the graph hashing stage, it learns to construct a compact binary hash code. The constructed hash function can well preserve the features of both the topology structure and the semantics information of a neural network model. Experimental results demonstrate that the proposed scheme is effective to represent a neural network with a short hash code, and it is generalizable and efﬁcient on different models.


Introduction
Of late, deep learning [1] has attracted the greatest attention in both academia and industry. It is increasingly applied to a variety of fields [2], such as image processing [3,4], natural language processing [5], audio processing [6,7], biometrics [8], etc. The core task is designing deep neural network (DNN) models. Researchers are developing new DNN models and their variants for specific tasks using frameworks such as TensorFlow, PyTorch, MXNet, Keras, Caffe, etc.
Obtaining a preferable pre-trained model requires a large amount of training data, the delicate designing of neural networks and expensive computing resources. It is no doubt true that each pre-trained model consumes a lot of comprehensive costs. However, these models are at risk of being stolen in direct or indirect ways. For example, pirates can directly steal by illegally copying deep neural network model files or use APIs opened by manufacturers to users to achieve indirect stealing by technical means such as knowledge distillation [9]. With the widespread use of cloud computing, DNN models are usually deployed on cloud server platforms. Pirated models may also be deployed, and the platform needs to provide users with discovery services of infringements. Due to the huge number of DNN models, we need technology to discover whether the uploaded model is suspected of piracy; therefore, we propose a new method: perceptual hash for neural networks. The flow chart is shown in Figure 1. We create a hash code for each model. Once a new model is uploaded, we compare the hash code of the new model and the existing models to retrieve similar models. Cloud server Upload Figure 1. Flowchart of preventing pirated deployment. We create a hash code for each model on the cloud server. When a DNN model is uploaded, the administrator will compare the new hash code with the existing hash codes. The uploaded model will be further checked if a similar hash code is retrieved.
To deal with the DNN retrieval task, we propose HNN-Net, a supervised two-stage scheme based on graph neural network (GNN) [10]. In the first stage, we use a graph generator to convert the DNN model to an undirected graph assigned with additional information we extracted after a series of operations. We use the graph hashing network to generate compact hash codes in the second stage. As the hamming distance computation is fast for the CPU, the proposed DNN model hashing can be applied to a large-scale database for retrieval. The proposed method is evaluated on a DNN model benchmark dataset we collected. The experiment results demonstrate that the proposed method is effective for neural network retrieval. The details of the two stages are shown in Figure 2, and we will further discuss the deep neural network hashing in Section 3.
Our contributions are three-fold, as follows: • We propose a new idea of generating perceptual hash for neural networks, which can be used in model protection; • The proposed deep hashing scheme based on neural graph work is capable of all kinds of deep learning frameworks; • The proposed method is effective that has a good retrieval performance.
The rest of this article is organized as follows. In Section 2, we present the related works of the perceptual hash of neural networks. In Section 3, we define the problem and describe the details of the proposed DNN hash method. We evaluate the capacity of our scheme and set up some ablation experiments in Section 4. Section 5 concludes the article by recapitulating our work.  Figure 2. The details of our two-stage scheme. In the first stage, a raw DNN model is converted into a graph with an initial node feature after a series of operations. Afterward, the graph, which is generated by Stage 1, gets its graph-level feature embedding. Then the embedding is used to generate Q-bit hash codes after binarization and predict the classification of the input model.

Computational Graph
A neural network model can be regarded as a series of complex computations with trainable parameters organized in a specific structure. One of the tools we use to describe the DNN model is a computational graph [11,12], which is a directed graph used to express a set of structured computations whose nodes and edges represent operators and directions of dataflow, respectively. The operator stands for a calculation function for input, and data flow goes along the graph's edges. When a data flow is passed through a node, it is dealt with by a node operator and generates a new value. Two examples of computational graphs for some simple functions are shown in Figure 3. Note that computational graphs are widely used for back-propagation on neural networks. Apparently, any DNN model can be represented as a computational graph.  . The computational graph for function f (x, y) = ax 2 + bxy + cy 2 . Each node stands for a specific operator such as addition and multiplication. Each edge stands for the direction of the dataflow. The graph is fed with x and y as input and produce an output as the final result.

Deep Image Hashing
Traditional cryptographic hashing algorithms have an "avalanche effect", which implies that a minute difference between two input data will cause a significant difference in the output hash codes. On the contrary, perceptual hashing generates similar hash codes for two similar images and diverse hash codes for different images, which has blazed a trail for researchers to measure image similarity.
Deep image hashing makes use of deep learning to generate perceptual hash codes for images. CNNH [13] is an early work to learn image representation and image hashing.
To unify representation learning and hashing learning, DPSH [14] and HashNet [15] both propose end-to-end hashing learning approaches based on pair-wise loss. Moreover, DNNH [16] utilizes the triplet ranking loss to guide the learning for the hash function. In addition, SDH [17] makes use of an objective function to minimize the intra-class variations and maximize the inter-class variations.
SSDH [18] proposes an assumption that the semantic labels are governed by several latent attributes with each attribute on or off, and classification relies on these attributes. Based on this assumption, SSDH employs a softmax classifier to guide deep hashing learning to better exploit the semantic information of images. This idea inspired us to adopt a classification layer to make better use of the semantic information of DNN models and unifies DNN model classification and retrieval in a single learning model, which is trained with pair-wise loss.

Graph Hashing
A graph (Graph) is a data structure modeling of a set of objects (nodes) and relationships (edges) between objects, and is widely used in various fields such as knowledge graphs [19], natural sciences [20] and many other fields [21,22]. The graph neural network (GNN) [10] is a widely used technique to process graph structure data based on deep neural networks. Graph convolution network (GCN) [22] defines a specific mechanism about how every node gathers information from its neighboring nodes and learns to generate its representation. Compared with early hand-crafted features, it performs much better in extracting effective features from graphs and representing nodes, edges or graphs in low-dimensional vectors. Additionally, attention mechanisms such as [23][24][25][26][27] have been proposed for graph-level or node-level tasks.
To solve the problem of graph similarity search, many GNN-based methods [22,26,[28][29][30] have been proposed to generate a graph-level embedding to minimize the distance between two graphs [31]. However, they are not effective enough to search large-scale databases in real-time.
To achieve a fast graph similarity search, GHashing [31] is the latest attempt that uses GNN to generate binary hash codes and graph-level embedding for fast graph similarity search. It uses graph attention pooling (GAP) [26] as its attention mechanism. However, GHashing can not convert DNN models into hash code directly. This is not suitable for practical application. In our task, it is taken as a basic foundation of our network for neural network hashing. Different from Ghashing, in our network, the DNN models can be directly used as input to get hash codes.

Problem Definition
We denote {M i } N i=1 as a DNN model database associated with its classification labels, and M q as an arbitrary irrelevant DNN model to query. DNN model retrieval is the task of retrieving the most similar model from the database with the query M q . A DNN model refers to a concrete executable DNN model implemented by a specific type of deep learning framework. Take PyTorch [32] as an instance. A Python class inherited from "torch.nn.Module" with its parameters loaded consists of a complete DNN model.
As we introducted in Section 1, the model can be converted into a graph. In this procedure, we trace the data flow until the model terminates to get a trace with a full record of operations that occurred in the data flow. From the trace, we can obtain rich information in addition to the topology structure information of the graph, such as the number of FLOPs (floating number operations of a layer) and the number of trainable parameters for each layer in various ways.
To handle the task, our proposed scheme learns a hash function that maps a DNN model M q to a Q-bit binary hash code h.
Given M q as the query, we computed hamming distances between the hash code for h and those for models in the database. Hamming distance measures the similarity between two binary hash codes, calculated as: where ⊕ is the exclusive OR operation and h 1 and h 2 are two hash codes. The details of the two-stage scheme are shown in Figure 2, and the overall training procedure is shown in Figure 4. We first converted a DNN model M q into an optimized computational graph which is assigned with additional information captured from the execution of the model, and then fed it into a GNN-based neural network, which consisted of graph feature extraction layers and fully-connected layers. The output was a Q-bit binary hash code. L topo and L class are, respectively, used for preserving topology structure similarity and semantic similarity between models. L quant reduces quantization loss caused by the conversion from continuous activation values to discrete binary hash code.

Stage 1: Graph Generator
A DNN model M q was first converted to an undirected acyclic graph (V, E , F ) with a trigger image I, where V is the set of nodes, E ⊆ V × V is the set of edges, and F is a set of functions that map a vertex to multiple attributes. The computational graph would be too large to handle if we defined basic operators like addition or multiplication as nodes. Instead, we merged the diverse operators into a single graph node, which was represented as a network layer in deep learning.
To make our scheme compatible with all kinds of existing deep learning frameworks, these merges followed a standard open-source format called ONNX (Open Neural Network Exchange) [33].

ONNX Operation
We denoted the function that maps a DNN model to the computational graph whose operators and data types obey the ONNX [33] standard as f onnx . ONNX provides the definition of its built-in operators and standard data types, where each computation dataflow graph is structured as a list of nodes that form a graph [33]. The set of built-in operators is portable across frameworks, and every framework supporting ONNX provides implementations of these operators on the data types. Hence, our scheme is compatible with most existing frameworks, including TensorFlow, PyTorch, Mxnet, Keras, Caffe, etc. [33].

Merge Operation
Many DNN structures contain duplicate sub-structure blocks that perform a similar role across different neural networks. Merging specific layers into one block, representing a node in the graph, can improve the hashing accuracy. Therefore, we applied hard-coded merge rules, i.e., Conv-ReLU, Conv-BatchNorm, Conv-BatchNorm-ReLU, and Conv-Conv-BatchNorm-ReLU, to the previous graph, and denote the merge operation by f merg . After ONNX and merge operations, the model M q is converted to a graph by:

Node Feature Embedding
To make full use of model characters for model hashing, we first extracted the number of FLOPs and parameters from the trace of the data flow. Each node v i ∈ V from the established graph G has 3 additional information mapping functions F = { f 1 , f 2 , f 3 } which, respectively, map a vertex to its operator type, the number of FLOPs and the number of parameters on the corresponding layer. In the last encoding step, for any node v i , its f 2 , f 3 returns continuous values while f 1 returns discrete value. For continuous values f 2 (v i ), f 3 (v i ), we designed an encoder function f enc (x) = max([log 10 (x)] + 1, 10) to map the input to discrete values, where [log 10 (x)] + 1 is the number of digits in the the integer part of x, e.g., 3 for 123.45 and 2 for 97.76. To normalize the feature dimensions, we used the max function to ensure that they drop within {0, 1, 2, ..., 10}. The three attributes were thus all discrete values. Then we encoded them into three one-hot vectors by f oneh and concatenated them into a unified one. The initial node representation u i for vertex v i ∈ V is where [x, y] means concatenating x, y along the feature dimension.

Stage 2: Graph Hashing
The graph hashing network H can be divided into three sub-networks, focusing on how to extract graph-level features and how to generate a compact Q-bit binary code that preserves both the topology structure similarity and the semantic similarity among models. We used pair-wise loss between two different DNN models in each iteration for training.
The first sub-network, which is called graph feature extraction, contained three graph convolution network layers (GCN) [22] and one graph attention pooling layer (GAP) [26], and output a graph-level feature. We denoted u i as the representation of node v i , and u i as the next representation of node v i processed by a GCN layer.
GCN [22] is a neighbor aggregation method that defines how a node aggregates embeddings of all its neighbors and learns to generate an output as its own subsequent embedding, as defined by The final output of GCNs U ∈ R N×D is the feature matrix whose i-th row is the feature vector u i . GAP [26] defines how we produced a graph-level embedding from a graph whose every node has its own node representation, as defined by Then, the fully-connected layers contain 5 layers denoted by FC 1 ∼ FC 5 . FC 1 receives the output features from GAP as its input and FC 1 ∼ FC 3 generates graph features. The binarization layers generate a Q-bit binary code based on the output of FC 5 . Finally, it outputs an embedding in continuous values and a discrete binary hash code.

Objective Loss Function
The objective loss function between two models M 1 and M 2 consists of three parts: topology loss, classification loss and quantization loss, and formally: where α, β and γ are hyper-parameters. L topo encourages our model to preserve more topology structure similarity and a part of semantic similarity between the two models, as defined by where GED [34] is the graph edit distance to measure topology structure similarity between two graphs. Since it makes no sense for our model when it becomes too large, we used R as an upper bound. Inspired by GHashing [31] that learning from an embedding function described above is better than learning a hash function directly, we used the output of FC 3 for constraint. L class is a classification loss that encourages neural networks to be discriminative among different model types, as defined by where the output of H is the classification output, and t 1 , t 2 are the ground-truth model labels of M 1 and M 2 , respectively. L quant encourages binary hash code to be more compact and carry more information in each bit and is defined as where 1 is a Q-dim all-ones vector. Quantization loss is a common problem for hashing based on deep learning. There is an information loss in the conversion from the continuous activation value to discrete binary hash code by the binarization. It is a common technique to reduce this loss by encouraging each unit of the activation values to be close to either 0 or 1.

Training Parameters
The hyperparameters were defined as follows: α = 10, β = 10, γ = 0.2. The implementation was based on TensorFlow. For the optimization, we trained the network for 200 epochs using the Adam optimizer with a minibatch size of 10, and the learning rate is 0.001. The whole training process took about 16 h on the NVIDIA RTX 1080 Ti GPU and Intel Xeon Silver 4210 CPU @ 2.20 GHz. Our dataset was generated with pre-trained neural network models built in PyTorch 1.8.1.

Dataset
Our new DNN model dataset consists of 22 models belonging to 10 categories extracted from the PyTorch package "torchvision.models", as shown in Table 1. We randomly delete 10% nodes and their related edges in computational graphs of models by 10 times for data augmentation. Finally, we pick the first 80% of data as the training dataset and the remaining 20% as the test dataset.

. Evaluation Metrics
We conducted traditional retrieval experiments and used common performance metrics in the retrieval task. We picked top-k models that have the smallest hamming distance with the query as the retrieval result. Evaluation metrics included precision, recall and F 1 score.

Methods in Comparison
We compared our method with GHashing [31] and a simplified version of our method denoted as HNN-Net w/o L class . The code of GHashing [31] is open-source without a license. Raw data stands for data that were not processed with hard-coded merge rules, and merged data stands for data that were processed with merge rules.
As introduced in Section 3.2, "merge" is a technique adopted by our Graph Generator to optimize the computations we generated. For example, if our merge rules contained a single rule "CONV -RELU" and we had a graph which only contains 2 two nodes (0:CONV,1:RELU), 1 edge "CONV -> RELU". After "merge", this graph would be converted to a graph with only 1 node (0:CONV-RELU) and 0 edges. "Raw data" means that all graphs were simply converted from ONNX format and would not be processed by the "merge" operation. If we apply "merge" to "Raw Data", we get "Merged Data". Figure 5 shows evaluation curves evaluated with varying numbers of hash codes Q. Figure 5a,b are evaluated on raw data and merged data, respectively. We adopt α = 10, β = 10, and γ = 0.2 for HNN-Net. HNN-Net performs much better than HNN-Net without L class and GHashing when Q is larger than 16 regardless of which data we use. Table 2 provides comparison results among varying methods, and the best ones are in bold. From the table, we can conclude that our method outperforms baselines in most cases. The improvement comes from the better information extraction from the deep neural networks according to the execution of the DNN model and the classification loss, which can preserve the semantic similarity between DNN models. Additionally, merge operation makes better use of transcendental knowledge in deep learning and helps us learn a better hashing function. In Figure 5b, 32-bit HNN-Net slightly outperforms 64-bit HNN-Net. 64 bits make a single hash code carry much more information than 32 bits. However, if a single hash code can carry too much information from the training dataset, overfitting can result, leading to a decrease in performance in the test dataset. Table 2. The comparison of recall, precision and F 1 score among GHashing [31], HNN-Net without L class and HNN-Net in different hash code bits evaluated on the raw data and the merged data.  Figure 6 shows the results of using merged data or not. Figure 6a,b demonstrate clearly that using merged data brings huge performance improvements, while for our method shown in Figure 6c, the improvement brought by a merge operation is much smaller than others.

Metrics
The essence of the merge operation is to utilize transcendental knowledge of neural networks, namely their semantic information. Thus, more improvements could be brought by a merge operation to an approach. The approach effectively extracts less semantic information. On the other hand, HNN-Net introduces semantic loss to its loss function, while GHashing and HNN-Net without L class do not use it and aim to extract more semantic information from data and generate better hash codes. Accordingly, experiment results in Figure 6c imply that the merge operation brings the least improvement to HNN-Net compared with the other two. It proves semantic loss in HNN-Net, and merge operations extract the same kind of information inside the data as expected. In other words, both of them extract semantic information from data effectively.  Figure 7 shows the performance of HNN-Net evaluated on raw data with varying β when α = 10, γ = 0.2, Q = 32. As we can see, the three criteria all perform the best when β = 10. A natural explanation is that a much larger β prevents the model from paying sufficient attention to the topology structure similarity among models, and a much smaller β restrains the model from preserving semantic similarities. Figure 7b is evaluated on raw data with varying γ when α = 10, β = 10 and the optimal weight γ = 0.2. Additionally, an appropriate L quant can reduce the information loss caused by the conversion from the feature embedding in continuous values to discrete binary hash code.

Conclusions
We issue the new problem of DNN model retrieval in the face of deep learning security threats. We propose a two-stage hashing scheme based on GNNs, and it is compatible with models implemented by most existing deep learning frameworks. HNN-Net generates hash codes that preserve both the topology structure and the semantic similarity of models, and then learns a classifier for discriminating DNN models. Results verify that our scheme is effective and performs much better than others. HNN-Net makes full use of the shallow information in DNN, such as topology information, network node information, and network operation intermediate result information. However, the use of deep information in DNN, such as the functional information of the network itself, is very limited. In the future, we will combine the shallow and deep information of the DNN model to generate better quality perceptual hash codes.