Parallel Particle Swarm Optimization Based on Spark for Academic Paper Co-Authorship Prediction

: The particle swarm optimization (PSO) algorithm has been widely used in various optimization problems. Although PSO has been successful in many ﬁelds, solving optimization problems in big data applications often requires processing of massive amounts of data, which cannot be handled by traditional PSO on a single machine. There have been several parallel PSO based on Spark, however they are almost proposed for solving numerical optimization problems, and few for big data optimization problems. In this paper, we propose a new Spark-based parallel PSO algorithm to predict the co-authorship of academic papers, which we formulate as an optimization problem from massive academic data. Experimental results show that the proposed parallel PSO can achieve good prediction accuracy.


Introduction
Solving optimization problems involves trying to find the best solutions to optimize performance indices. Exiting optimization methods mainly include mathematical programming methods and stochastic search optimization methods. Compared with mathematical programming methods, stochastic search algorithms are simple, adaptive, and can be applied to various complex problems like black box problems or multimodal problems. Among many stochastic search algorithms, Particle Swarm Optimization (PSO) [1] is one of the most popular, which has been widely used in various optimization problems and successful in many fields [2]. However, with the advent of the big data era, many emergent optimization problems have involved the procession of massive data. These kinds of problems are beyond the ability of traditional PSO, but should be solved parallel to distributed clusters. Traditional parallel approaches, such as using the Message Passing Interface (MPI) [3], require significant manual effort to ensure load balancing and manage communication flows. Hadoop MapReduce [4] is simple, transparent, scalable, and provides automatic load-balancing and fault tolerance. The first MapReduce version of the PSO algorithm was proposed in 2007 [5], and since then, a few MapReduce-based PSO variants have been brought out and achieved relatively promising results [6,7].
Although MapReduce-based PSO is able to process big data, MapReduce needs frequent time-consuming hard disk I/O, which makes it unsuitable for iterative procedures. Apache Spark [8], an in-memory-based computing framework, makes up for this deficiency of Hadoop MapReduce, and has become the most popular distributed computing framework. Spark has also gradually replaced MapReduce in recent years as the preferred approach for parallel stochastic search optimization algorithms. Among the Spark-based PSO, the majority are for numerical optimization problems [9][10][11], a few are for clustering and classification on small datasets [12,13], and none for real big-data applications.
To further explore the usability of Spark-based intelligent optimization algorithms in a big data environment, we aim at the problem of academic paper co-author prediction based on big data from the real world.
In this paper, we formulate the co-author prediction problem as an optimization problem, inspired by [14], where the author applied a covariance matrix adaptive evolution strategy (CMA-ES) to predict the twitter link, and CMA-ES is not parallel. We parallel PSO based on Spark to optimize the linear combination weights of 12 topological similary indices for co-authorship prediction, and pay more attention to the design and parallel computing of fitness evaluation in order to better adapt to big data processing, which is different from works simply using common benchmark functions. Experimental results illustrate the usability of the designed algorithm for prediction in a big data environment.
The rest of the paper is organized as follows: Section 2 briefly describes the background knowledge and related work; Section 3 provides a detailed description of the implementation method for link prediction; Section 4 is an analysis of the experimental evaluation methods and results; and Section 5 concludes the whole paper and provides an outlook on future work.

Particle Swarm Optimization Algorithm
In 1995, Kennedy and Eberhart proposed a particle swarm algorithm [1] inspired by the results of artificial life research, which is a global random search algorithm based on swarm intelligence generated by simulating the migration and swarming behavior of birds in the foraging process. The basic core of the algorithm is to use information sharing of the individuals in the group so that the movement of the whole group produces an evolutionary process from disorder to order in the problem-solving space, so as to obtain the optimal solution of the problem.
In particle swarm optimization, the particle swarm is initialized as a random solution set, and each particle in the search space is a potential solution of the optimization problem, and the optimal solution is found through iteration. In the d-dimensional search space, each particle has a d-dimensional position vector and velocity vector, and the fitness value of the current position is calculated according to the objective function. In each iteration, the particle continuously updates itself through the optimal solution found by itself and the optimal solution currently found by the entire population.If the set maximum number of iterations is reached, or other specified termination conditions are met, the iteration stops. The speed and position update formulas are as follows: . . , N, N is the number of particles in the population, v i is the velocity of the particle i, r 1 and r 2 are random numbers between 0 and 1, x i is the current position of the particle, c 1 and c 2 are learning factors, usually set c 1 = c 2 = 2, pbest i is the best position in the particle's personal history, and gbest i is the best location in the history of the entire population.

Co-Authorship Prediction
Given the network snapshot G = (V, E t ) at time t,V represents the nodes at all time steps, and E t represents the link at time t, predicting the most likely new link at the next time step t + 1. This is called link prediction. For co-authorship prediction, the node represents the author and the link represents co-authorship of the two authors. Link prediction strategies are generally divided into three categories: similarity index-based strategies, maximum likelihood algorithms, and the probabilistic model. For large-scale sparse networks, similarity-based strategies are generally used, and each similarity index is generally divided into a topological similarity index and an individual characteristics similarity index.
Similarity indices can capture the commonalities between two nodes. Based on these similarity indices, the close relationship between nodes in the network can be calculated. However, there is no single similarity index that can completely extract the newly formed links, and there is no guarantee that the effect will be significant for the general situation. If the predictor combines the information of multiple similarity indexes together for comprehensive consideration, it can improve the link prediction. This is confirmed in the link prediction work of Bliss et al. [14]. This paper uses the 12 common topological similarity indices selected in the paper (see Table 1). By optimizing the coefficient weight of the linear combination of each of the similarity indices, link prediction has actually become an optimization problem. Table 1. Topological similarity indexes selected in this paper. G = (V, E) is a network consisting of vertices V and edges E. The neighbors of node u is Γ(u) = {v ∈ V | e u,v ∈ E}, and the degree of node u is represented by k u , A is the adjacency matrix, and a path of length n between u, v ∈ V is ϕ n (u, v).

Apache Spark
Spark [8] was born in the AMPLab of the University of California, Berkeley in 2009. It was implemented in the scala language. It is a novel and unified big data processing framework that has the characteristics of fast running speed, good ease of use, strong generality, and that it can run anywhere. Spark based on memory computing is more suitable for iterative and interactive applications, effectively making up for the computationally intensive and time-consuming defects of MapReduce in processing iterative operations.
Resilient Distributed Datasets (RDD) is the cornerstone of Spark. It is the core distributed memory abstraction that implements Spark data-processing. It has the characteristics of immutability, partitionability, and flexibility, which facilitate the performance of memory calculations on large clusters in a fault-tolerant manner. RDD mainly uses two operations to process data: one is the transformation operator, which is used to convert the RDD to build the blood relationship of the RDD; the other is the action operator, which is used to trigger the calculation of the RDD to obtain the lineage of the RDD [15] or save the RDD in a file system such as HDFS. Due to the "lazy" nature of RDD, all previous tansformation operations will only be executed when the action operation is performed.

Related Work
In recent years, in order to better adapt to large-scale data processing, random search algorithms have been continuously implemented under the big data computing frameworks MapReduce and Spark. Taking the particle swarm optimization algorithm as a typical example, since its first successful attempt of parallel implementation under MapReduce in 2007 [5], related work has emerged one after another, and significant results have been achieved. Among these works, there are MapReduce implementations of PSO variants. For example, Sadasivam et al. proposed a hybrid PSO-GA (genetic algorithm) implementation under MapReduce to solve the task allocation problem [6], which helps the algorithm solve the problem of time-consuming for data and calculation-intensive application analysis to obtain the best performance. Wang et al. proposed a parallel K-PSO based on MapReduce combining the PSO and the K-means algorithm [7], which uses PSO to improve the global search capability of K-means and uses MapReduce parallelization to enhance its ability to process large amounts of data. Li(B) et al. implemented a quantum-behaved PSO that can effectively prevent PSO from falling into local optimal problems under MapReduce [16]. Experiments show that parallel QPSO is superior to the serial version in terms of search capability and solution quality. There are also some related to specific application issues in the works, such as large-scale network intrusion detection systems [17,18], real-time clustering of Tweets [19], and minimizing thermal residual forces in ceramic matrix composites [20]. Parallel PSOs based on MapReduce perform well in reducing time and coping with large amounts of data.
From about 2016 to today, Spark has replaced MapReduce with its significant advantages in rolling, taking over the role of the big data computing framework in the parallel implementation of PSO. According to our investigations, most of the known literature on the parallel implementation of Spark involve solving numerical optimization problems. For example, Guo et al. used the PSO implemented by Spark in parallel to deal with the optimization problem of Web service composition with different quality but similar functions in the cloud computing environment [9]. Duan et al. parallelized the three most frequently cited particle swarm optimizer versions on Spark to solve the problem of high computational cost [10]. Zhang et al. used Spark and a parallelized PSO algorithm to construct reservoir dispatching rule optimization [11]. A few include clustering or classification on small data sets. For example, Sherar et al. proposed a hybrid K-means PSO implemented on Apache Spark for large-scale clustering [12], and Al-Sawwa et al. proposed a scalable design and implementation of PSO based on Spark to extract useful information for decision support [13]. However, there is none for research on real big data applications.
There are precedents for applying intelligent optimization algorithms to link prediction, such as the paper by Sherkat et al. which studied structural link prediction in social networks based on ant colony approach [21], Barham et al. performed link prediction based on the whale optimization algorithm [22], Shi et al. studied user relationship prediction based on matrix decomposition and hybrid PSO [23], and so forth. However, all of these algorithms are non-parallel algorithms, which make it difficult to cope with massive data. It is an inevitable trend to parallelize intelligent optimization algorithms in order to better fit the big data environment.

Data
The academic social network co-author data set used in this experiment is a real data set taken from AMiner, which aims to provide a comprehensive search for researchers on social networks and mining services, having integrated academic data from multiple sources. These data specifically come from a paper that studied the maximization of influence in dynamic social networks [24], which builds a dynamic collaborator network from ArnetMiner, with a time span of 27 years from 1986 to 2012. The year is the timestamp. In each timestamp, an undirected edge is created between two authors who have collaborated on at least one paper in the last 3 years, treating each undirected edge as two symmetrical directed edges to transform the undirected co-author network into a directed network.
We select the first five years of data in the co-author data set, that is, the co-author directed network from 1986 to 1987 to carry out the experiment. Each year's sparse matrix data is a set of two sets of synthetic node-node pairs and corresponding similarity index values between all author nodes that appear in the year. Using the year as the time step to predict the links, we predict the links in the next year according to the links in the previous year, and so on, so we obtain four verification sets (1986->1987, 1987->1988, 1988->1989, 1989->1990). The basic information on the five years of data is shown in Table 2.

Experimental Environment
The platform and experiments are carried out under the Spark framework. The network topology is shown in the Figure 1. The platform includes a master node and multiple slave nodes. As a scalable platform, the number of workers can be changed as needed. We have four servers in total, so here we have three slave nodes, and all nodes have the same hardware and software configuration, as shown in Table 3. For hardware, the servers are configured with Intel(R) Xeon(R) Gold 5215 CPU. Each node has 40 CPU cores at 2.50 GHz and 240 GB of physical memory. In aggregate, our four-node cluster has 160 CPU cores, 960 GB RAM. As for software, each server was installed with a 18.04.1-Ubuntu operating system. We built a four-node hadoop cluster with Hadoop 2.10.0 at first, and built a spark cluster with Spark 3.0.0 on this basis, whose built-in scala version is Scala 2.12.10. Both MapReduce and Spark were deployed on JDK 1.8.0_131.   Figure 2 roughly describes the process of link prediction using particle swarm optimization. Before using the algorithm, we calculated all the topological similarity index values and stored them in the N × N sparse matrix S i , i = 1, 2, . . . , 12, and N is the number of author nodes in the academic social network here. This process is actually to get all the similarity index value data corresponding to every node pair. The similarity indices are linearly combined according to the method shown in Formula (3), and the coefficient w i is updated by the evolution of the particle swarm algorithm which means w is actually equal to the position vector of a particle. The initial value of w i is randomly selected from 0 to 1.

Fitness Evaluation
In each iteration, the node-node pair corresponding to the topN values of the score ranking in S is considered as our predicted link. The aim of our algorithm is to find the best w so far within a limited number of iterations that maximizes the proportion of correctly predicted links in the predicted links. By comparing the predicted link with the link appearing in the next year, incorrect links can be obtained. As shown in Formula (4), we take the fitness value as the ratio of the number of incorrect links to the number of predicted links. Since the number of predicted links is constant topN, the fitness value is proportional to the number of incorrectly predicted links, that is, inversely proportional to the number of correctly predicted links. When the fitness value is equal to 0, the number of incorrect links is also equal to 0, that is, all predicted links are correctly predicted links, which is the ideal result we most want to see. Therefore, each iteration can find the optimal particle in the population that can make the fitness reach the smallest in history, and it will enter the next iteration. In our algorithm, when the maximum number of iterations is reached, or the historically optimal particle can satisfy the condition that the fitness value is 0 in advance, the iteration stops.

Spark Implementation of Particle Swarm Algorithm
The complete pseudocode of Spark is shown in Algorithms 1-3. Algorithm 1 is the main program running on the Driver side. Before starting, the data file that needs to be used is loaded into RDD through the textFile method of Spark, and the RDD is then converted appropriately. The sparseMatrixRDD and testRDD are in the form of key-value pairs, and considering that the program needs to use them every time the fitness value is calculated, we used Spark's persist method (line 2) to persist these two RDDs in memory for later use, and SparseMatrixRDD will use the partitionBy method to perform the Hash partition operation before this operation. Next, we randomly initialized a particle group RDD particlesPreRDD without a fitness value attribute by calling Spark's parallelize method and map method, and then called Algorithm 2 to calculate the fitness value of the initialized particles and the best fitness value of the individual (line 5). We obtained the particle group RDD particlesRDD used in the iterative loop (line 6), and then used the best individual gbTemp in the current initial particle group as the global historical best individual gb temporarily (line 8). When the upper limit of the iteration is not met, in each iteration loop, all particles and the best particles will be broadcast down (line 10, 11), and then Algorithm 3 will be called to update the speed and position in the particle column. The best position and the best fitness value of the individual were also adjusted accordingly according to the situation, and the best individual gbTemp (line 13) in the current particle swarm after the update was found again. If it is better than the recorded global history best individual gb, gbTemp was used to update gb (line 14,15). In our link prediction for academic paper collaborators, since the dimension of w needs to be equal to the number of similarity indices, that is, the particle dimension is 12, we set the population size to this value for convenience in the experiment. The population is small and the amount of data is huge. The most time-consuming part of the particle swarm algorithm is the calculation of the fitness value. Therefore, how parallel calculations should be performed when evaluating the fitness value of the particle swarm algorithm implemented by Spark becomes a key consideration. In Algorithm 3, before calculating the fitness value, you must first obtain the predicted link. This requires certain operations on the key-value pair RDD sparseMatrixRDD (line 2) that has been divided into areas. First, the mapValue method is used in Spark to perform a separate operation on the value. For operation, the product sum of each similarity index corresponding to the link and each dimension value of w (that is, each dimension value of the particle position vector) was obtained, and this product sum was used as a score for judging whether a link is a predicted link. Due to the huge amount of data, in order to avoid the use of sortByKey and other shuffle-related operators as much as possible, and successfully extract TopN in hundreds of millions of data, we used mapPartition to find the smallest heap in each partition, and then used flatMap to summarize each partition first. The minimum heap set data is then the smallest heap. After flattening, the RDD key was taken to get the predicted link RDD predictRDD. After that, the predicted link was compared with the test link, and the fitness value could be easily calculated.

Algorithm 3 updateParticle Algorithm
Input: partcle,gbest,sparseMatrixRDD, testRDD, topNBC Output: particle 1: Update velocity and position according to Formulas (1) and (2) 2: Evaluate fitness using Algorithm refa2 3: if fitness < pbFitness then 4: set current particle to pbest 5: end if 6: return particle Figure 3 shows the fitness20 (topN = 20) results of new links formed during the training period from 1986 to 1987. The black solid line depicting the "Topo12" predictor shows that the average optimal fitness value of 100 candidate solutions dropped sharply from 0.7505 to 0.0475 within 10 generations, and then eased slightly, and dropped to 0.006 by the 27th generation. In the 50th generation, it was reduced to 0.0025. After that, the line was getting closer and closer to the X-axis. By the 87th generation, it completely coincided with the X-axis, and the average optimal fitness value reached a satisfactory result of 0. The curves are consistent with the characteristics that the particle swarm optimization algorithm converges quickly at the beginning of the iteration and slowly at the end, and the overall convergence is relatively fast. In fact, the optimal fitness value converges to 0 much earlier in most runs, but the image depicts the average result over 100 runs, and because the particle swarm algorithm has the disadvantage of easily falling into local optima late in the process, several of these runs take far more generations than normal to reach a fitness value of 0, thus lengthening the number of generations required for the overall average. To better understand the performance of each topological similarity index in the employed link predictor, we plotted Figure 4 to visualize this aspect of information. Figure 4 shows all 100 solutions obtained by evolving the particle swarm optimization algorithm in parallel with Spark for 250 generations, where w is used as the horizontal row. The i-th column represents the coefficients of w i used for the linear combination, and the color of the axes indicates the position where the coefficient values are located. It can be observed from the images that the 100 candidate solutions differ significantly from each other. Nevertheless, we can still find more positive than negative values for the Average Path Weight and Katz, with the former accounting for more than 90% of the positive values and the latter not even having any negative values, while Preferential Attachment is basically all negative with only two sporadic positive values. This means that for a high-scoring author pair, if it contains a large number of positive weights for Average Path Weight and Katz and a large number of negative weights for Preferential Attachment, then a link between the two authors will be more likely to be generated in the future, that is, more likely to collaborate. The ranking frequencies of each similarity index were visualized according to their coefficients, as shown in Figure 5. The coefficients are ordered from the most positive (in first place) to the most negative (in 12th place). As can be seen from the images, Average Path Weight and Katz frequently occupy the first to fourth positions in the ranking, while the Hub-promoted Index, Hub-depressed Index, Leicht-Holme-Newman Index, Salton Index, and Sorenson Index ranked 8th to 12th most frequently. The other indices were relatively dispersed. Since the positive class is much smaller than the negative class in large sparse networks, given this imbalance, even for random link predictors, metrics such as accuracy and negative predictive value are very close to 1. Therefore, this paper puts more attention on recall and accuracy, which are shown in Equations (5) and (6), respectively, and Equation (7) is a combined metric F β that combines the two.

Experiment and Result Analysis Experimental Evaluation Methods and Results
β is used to adjust the weight of recall and accuracy-when β = 1 both weights are the same, if the accuracy is considered more important then β is reduced, and if recall is considered more important then β is increased accordingly.
Adjusting β to one of 0.5, 1, and 2, respectively and plotting Figure 6, it is easy to find that F 1 reaches its extreme value roughly at N ≈ 10 3 . Since the number of selected academic network collaborator nodes increases with the year, the corresponding number of node-node pairs of links consisting of any two authors also increases with the year with a difference of hundreds of millions or more. It is observed that for years with a higher number of links, the F β value is also clearly higher.  Figure 7 depicts the precision of the link prediction under the top N-scoring author-author pairs. It is not difficult to find that the fitness function of the algorithm runs achieves essentially zero-error precision across the validation sets when scoring author-author pairs below about 10 2 are selected, and extremely high precision between 10 2 and 10 3 . After that, the curve plummets from smallest to largest by validation set year, mainly because all correctly predicted links have been basically identified until N is about 10 3 , and increasing N further just increases the false-positive rate in vain. Overall, it can be seen that our prediction of co-authorship works well.

Summary and Outlook
In big data research, the problem of link prediction in social networks has always been an important area. In this article, we use the Apache Spark framework to design and implement a parallel particle swarm optimization link prediction algorithm for the first successful prediction of academic paper cooperation relationships. By paying more attention to the design and parallel computing of fitness evaluation, our algorithm adapted to the task of big data processing well. We conducted convincing experiments for the proposed algorithm on the real academic paper collaborator data set. We drew a graph of the average convergence of fitness values, and indirectly observed the performance of each similarity index through the range and ranking of the best w dimension values, and used the evaluation indicators precision and F β on four validation sets to further observe the prediction effect. The design experiment observation also shows that the link prediction effect of our method is obviously better than papers simply choosing serial intelligent optimization algorithms or simply using a big data framework without adjusting the fitness calculation to suit the real big data applications, which illustrates the effectiveness of the particle swarm optimization algorithm and the high adaptability of Spark to iteratively process large-scale data even when the population is small. In future work, we suggest using improved variants of particle swarm algorithm or other swarm intelligence optimization algorithms for link prediction or application in other real big data instances, which may have more different gains.

Conflicts of Interest:
The authors declare no conflict of interest.