A Ranking-Based Hashing Algorithm Based on the Distributed Spark Platform

: With the rapid development of modern society, generated data has increased exponentially. Finding required data from this huge data pool is an urgent problem that needs to be solved. Hashing technology is widely used in similarity searches of large-scale data. Among them, the ranking-based hashing algorithm has been widely studied due to its accuracy and speed regarding the search results. At present, most ranking-based hashing algorithms construct loss functions by comparing the rank consistency of data in Euclidean and Hamming spaces. However, most of them have high time complexity and long training times, meaning they cannot meet requirements. In order to solve these problems, this paper introduces a distributed Spark framework and implements the ranking-based hashing algorithm in a parallel environment on multiple machines. The experimental results show that the Spark-RLSH (Ranking Listwise Supervision Hashing) can greatly reduce the training time and improve the training e ﬃ ciency compared with other ranking-based hashing algorithms.


Introduction
With the continuous development of computing technology and digital media technology in recent years, data generation is increasing every day. This data exists in many forms, including text, images, audio, video, and other forms. Obtaining the information that people need from these massive and high-dimensional data quickly and accurately is an important technical problem [1,2].
At present, there are mainly two ways to solve such problems. One is a tree-based spatial partitioning method, which mainly represents red-black trees, kd-tree, and R-trees [3,4]. However, the disadvantages of this method is that it is only applicable to low-dimensional data. When the dimension rises sharply, it will produce problems such as "dimension disaster", and its search efficiency is close to a linear search function. The other method is a hashing-based search method. This method is also divided into two categories, with one being the data-independent method and the other being locality-sensitive hashing (LSH) [5,6]. This second method employs data dependence. It is also a popular machine learning-based approach that encodes the relevant characteristics of the learning data, thereby improving the retrieval speed and reducing the storage cost [6]. However, some hashing algorithms are currently taking too long to meet the search requirements of the current big data environment.
On the other hand, as the data scale is ever-increasing, the storage and processing requirements of data cannot be met in a stand-alone environment. Therefore, distributed processing systems have emerged for big data, mainly including Hadoop, Storm, and Spark. These distributed systems can process data quickly and in parallel by storing data on multiple computer nodes. By combining the advantages of learning with hashing and distributed systems, this paper designs and implements a distributed learning and hashing method based on the Spark computing platform, which can reduce training time greatly and improve training efficiency. Section 2 of this paper introduces some of the main learning points for hashing and ranking-based hashing methods. Section 3 introduces the ranking-based hashing algorithm and the resilient distributed dataset model in Spark. Section 4 describes the distributed ranking-based hashing algorithm, along with its design and implementation in the Spark platform. Section 5 analyzes the experimental results of the distributed ranking-based hashing algorithm on several data sets and compares it with several other algorithms. Section 6 is the conclusion.

Related Work
In recent years, the learning methods for hashing have been researched extensively due to the faster retrieval speed and lower storage cost, and the core idea is to convert high-dimensional data into compact binary codes by using various machine learning algorithms. By setting a reasonable learning goal, the obtained Hamming codes can maintain the similarity of the original data. The convenient calculation of the Hamming distance also improves the retrieval efficiency of large-scale data.
Although the above hashing learning methods have achieved good results, they rarely consider the ranking information in the actual search task. This is because, in general, in the process of searching through the search engine, the returned query results are arranged from top to bottom according to the degree of relevance to the query point.
Based on this, some ranking-based hashing methods have appeared in recent years [18][19][20][21][22][23][24], which preserve information by retaining triples (such as learning hash functions using column generation (CGH) [25], Hamming distance metric learning (HDML) [26], etc.) or listwise supervision information (listwise supervision hashing (RSH) [27], ranking preserving hashing (RPH) [28], deep semantic ranking-based hashing (DSRH) [29], discrete semantic ranking hashing (DSeRH) [30], etc.) to learn to generate hashing codes. The triplet supervision method is used to establish a triple x q , x i , x j , where x q represents the query point, x i represents data similar to the query point, and x j represents the data that is not similar to the query point. The code is created by establishing the loss function to make x i more similar to x q , and x j less similar to x q . The listwise supervision method is used to rank all the points in the database in the Euclidean and Hamming spaces, according to the similarity between the query points and data points, and so that the ranking of each point is as consistent as possible in the two spaces.

Ranking-Based Hashing Algorithm
The ranking hashing algorithm is one of the supervised learning methods used for hashing algorithms, which better satisfies the search task requirements faced by people in real life. This paper adopts a ranking-based hashing algorithm based on listwise supervision. The basic idea is shown in Figure 1. Here, x 1 , x 2 , x 3 , and x 4 represent four data points in the datasets, and q represents the query point. By calculating the distance between q and four data points in the Euclidean space and then ranking the four points according to the distance, the ranking list R 1 = (r 1 , r 3 , r 2 , r 4 ) is obtained, where r 1 , r 3 , r 2 , r 4 represent the relevance ranking of x 1 , x 3 , x 2 , x 4 and query point q, respectively. At the same Information 2020, 11, 148 3 of 9 time, by encoding all the data points and calculating the distance between q and the four data points in the Hamming space, the ranking list R 2 = (r 1 , r 2 , r 3 , r 4 ) is also obtained. Finally, we compare R 1 and R 2 , then construct the loss function L to keep R 1 and R 2 as consistent as possible. q represents the query point. By calculating the distance between q and four data points in the Euclidean space and then ranking the four points according to the distance, the ranking list ( ) , , , R r r r r = is obtained, where 1 3 2 4 , , , r r r r represent the relevance ranking of 1 3 2 4 , , , x x x x and query point q , respectively. At the same time, by encoding all the data points and calculating the distance between q and the four data points in the Hamming space, the ranking list (  )   2  1 2 3 4 , , , R r r r r = is also obtained. Finally, we compare 1 R and 2 R , then construct the loss function L to keep 1 R and 2 R as consistent as possible.

Overall Description of the Algorithm
At present, the complexity of most of the ranking-based hashing algorithms is too high to meet the existing training requirements, while the distributed Spark platform can execute the algorithm flow in parallel and shorten the training time. Therefore, this paper proposes a ranking-based hashing based on the distributed Spark platform. The algorithm is described as follows: 1.
In the distributed Spark environment, all the data in the datasets are mapped to different averaged working nodes. The Euclidean distances between the query points in the query set and all the data points in the datasets in each working node are calculated, and then the distance is ranked. The actual ranking of each point is obtained.

2.
Similarly, the query points and all the data points are converted into binary codes on each working node, and the Hamming distance is calculated to obtain the ranking in the Hamming space. 3.
According to the loss function, minimize the inconsistency of the data points in the two spaces is minimized. The data transformation matrix on each working node is calculated, and then all the nodes are summed and the average values are calculated using the gradient descent method until the algorithm converges or the number of iterations is reached.

The Details of the Algorithm
Suppose there are data points in the dataset that expressed as χ = {x 1 , x 2 , . . . , x N } ∈ R d×N , where d represents the characteristic dimension of the data and there is also a query set Q = q 1 , q 2 , . . . , q M ∈ R d×M . The number of nodes in the distributed Spark cluster is S. For any of the working nodes, the dataset χ s (χ 1 ∪ χ 2 ∪ . . . ∪ χ s = χ and χ i ∩ χ j = 0 ) is assigned, along with the query set Q. For any query point q j in the query set Q, we can calculate the distance from each of the datasets χ s directly based on the distance formula of the two points, then ranking the points according to the distance, obtaining the relevance ranking list (we believe that the smaller the Euclidean distance from the query point, the greater the correlation, thus the smaller the ranking), which is recorded as: where r j i ∈ 1, N S , which represents the similarity ranking between the data sample x j i and the query point q j . If r j m < r j n , this means that x m is more similar to the query point q j than x n . Our goal is to obtain the hashing function ƒ(·) to generate the binary codes h : R → {−1, 1} and to define the hashing function mapping as follows: where W ∈ R B×d , B represents the length of the codes, and we define the following formula to calculate the Hamming distance after the query points and the training data points are encoded: Based on the above formula, we divide the codes into several subspaces of the same length, where the parameter σ represents the length of each subspace. Here, B σ is the number of subspaces that are divided, and then according to the above Hamming distance calculation formula, the ranking of each point in the datasets is obtained. The ranking information of point R j m is: Finally, we compare the sizes of r j m and R j m to construct the loss function, so that the two are as consistent as possible.
For any worker node in a spark cluster, we define the loss function as: where 1 2 log(1+r i ) represents the ranking weight of the data points; obviously, the greater the weight of the real ranking in the Euclidean space, the smaller the value. Therefore, for all nodes, the total loss function is defined as: where λ represents the balance factor and λ 2 WW T − I 2 F represents a regularization term to prevent overfitting during training. Finally, we derive the above loss function: Information 2020, 11, 148 5 of 9 The gradient on each working node can be obtained by the updated rule of the gradient descent method: where α represents the learning factor, the total of which can be calculated as: The entire algorithm execution architecture is shown in Figure 2.  The cluster system of the experimental platform is mainly composed of ten hosts, one of which 54 has a master node (name node) and nine computer nodes (data nodes  [28]). In the experiment, the data is subjected to pre-processing, such as zero-meanization, and then

Experimental Platform Construction
The cluster system of the experimental platform is mainly composed of ten hosts, one of which has a master node (name node) and nine computer nodes (data nodes). The CPU for all hosts is an Intel Core i5-3470, and the memory is 8GB DDR3. In addition, the software environment plays a vital role as a prerequisite for building Hadoop and Spark distributed cluster environments. The specific configuration of the software environment of each machine is shown in Table 1.

Experimental Datasets
This paper mainly experiments with two public datasets, which are the CIFAR-10 dataset and MNIST dataset, and compares these with other ranking-based hashing algorithms (RSH [27], RPH [28]). In the experiment, the data is subjected to pre-processing, such as zero-meanization, and then the image dataset is converted into matrix form by feature extraction and transformation to facilitate the operation. CIFAR-10 dataset: This contains 60,000 1024-dimensional image data points and includes a total of 10 categories. In this experiment, 320-dimensional gist features were extracted from each photo in the CIFAR-10 dataset, 2000 images were randomly selected as a test dataset, another 2000 images were used as a training dataset, and 200 images were used as a query dataset.
MNIST dataset: This is a handwritten digital image dataset (0-9) containing 60,000 use cases. We also extracted 520-dimensional GIST features for each photo, and randomly selected 2000 images as test datasets. Here, 2000 images are used as a training dataset and 200 images are used as a query dataset.

Experimental Result
Experiments compare the time required for several hashing algorithms to iterate once when encoding lengths are 16, 32, and 64 bits in two data sets. It can be seen from Figures 3 and 4 and Tables 2 and 3 that both Spark-RLSH datasets have the shortest training times and the fastest running speeds. RSH and RPH have the longest training times due to their operation in the stand-alone environment. At the same time, with the linear increase of the coding length, the training time for Spark-RLSH also grows linearly, while for RSH and RPH, the two ranking-based hashing algorithms, the training time has nothing to do with the length of the code basically remains unchanged.          At the same time, Figure 5a,b also compares the top 50 normalized discounted cumulative gain (NDCG) values returned by the three ranking-based hashing algorithms for the two datasets. Although the NDCG value for Spark-RLSH is slightly lower than for RSH and RPH, considering the training time cost, this result is acceptable when the code length is short.      It can be seen from this result that the ranking-based hashing algorithm implemented on the Spark distributed platform can greatly shorten the training time of the algorithm and improve the training efficiency.

Conclusion
This paper introduces the running architecture and working principle of Spark in detail, and proposes the basic principle and algorithm-specific flow of the ranking-based hashing algorithm implemented on the distributed Spark platform. Here, we divide large datasets into small datasets with the same number of working nodes, and compare the ranking of the query set and the small dataset in the Euclidean space and the Hamming space for each working node. Finally, we construct the loss function and run the gradient descent method until the function converges or reaches the number of iterations that minimizes total loss. Experiments show that the Spark distributed platform can effectively reduce the training time of the model and greatly improve the training efficiency. In the future, we can consider the following points to improve the existing ranking-based hashing algorithm:

1.
Improvements in the ranking formula. After converting the data points into binary codes, all the data needs to be ranked according to Hamming distance, and then the ranking list can be constructed. This requires comparison between every two points, so that the time complexity of the designed ranking algorithm is too high, which can seriously affect the training efficiency. Later, we can consider redesigning the ranking formula to run the algorithm model with the lowest cost.

2.
The gradient descent method is implemented as a whole on the distributed platform. This paper considers the complexity of the algorithm in the ranking process. Each working node runs the algorithm model and implements the gradient descent method independently. Although this method can reduce the training time of the model effectively, there is no overall calculation gradient, and there is a certain training error. In the future, the overall comparison of the Hamming distance between the query set and the dataset can be considered, which can improve the accuracy of the search and reduce the training time simultaneously.