A Community Detection and Graph Neural Network Based Link Prediction Approach for Scientific Literature

This study presents a novel approach that synergizes community detection algorithms with various Graph Neural Network (GNN) models to bolster link prediction in scientific literature networks. By integrating the Louvain community detection algorithm into our GNN frameworks, we consistently enhance performance across all models tested. For example, integrating Louvain with the GAT model resulted in an AUC score increase from 0.777 to 0.823, exemplifying the typical improvements observed. Similar gains are noted when Louvain is paired with other GNN architectures, confirming the robustness and effectiveness of incorporating community-level insights. This consistent uplift in performance reflected in our extensive experimentation on bipartite graphs of scientific collaborations and citations highlights the synergistic potential of combining community detection with GNNs to overcome common link prediction challenges such as scalability and resolution limits. Our findings advocate for the integration of community structures as a significant step forward in the predictive accuracy of network science models, offering a comprehensive understanding of scientific collaboration patterns through the lens of advanced machine learning techniques.


Introduction
Graphs represent intricate data structures comprised of vertices and edges, where link prediction aims to foresee potential connections and their strengths within a graph's structure [1].While traditional methods like Markov chains and machine learning have found utility in applications such as website browsing and citation prediction [2] [3], they often rely heavily on node attributes, which can be challenging to obtain and validate in real-world settings.Consequently, these methods may not always yield high accuracy in predictions.Graph embedding techniques, by contrast, can effectively encapsulate complex structures and relational dynamics, paving the way for more accurate applications of link prediction algorithms, now an active area of research.
In this paper, we introduce a pioneering approach that unifies community detection algorithms with Graph Neural Networks (GNN) to refine link prediction within scientific literature networks.Employing the Louvain algorithm, we reveal latent community structures that GNN models can exploit to enhance link prediction precision.This methodology is particularly potent in scientific collaboration networks, where understanding and forecasting collaborative patterns is crucial.For instance, our approach could significantly improve paper recommendation systems, aiding researchers in finding relevant studies and potential citations when preparing their scientific manuscripts [4].Such systems are invaluable given the recent exponential growth in scientific publications and the critical role of citing pertinent work.
Our integrated approach, succinctly depicted in Figure 1, marks a substantial advancement in predictive network analysis.It is especially pertinent to the scientific literature domain, where discerning and anticipating collaboration and citation dynamics can offer profound insights, potentially fostering novel research intersections and driving the progress of scientific inquiry forward.The structure of the paper is as follows: Section 2 discusses related work to frame the current advancements in link prediction and graph neural networks.Section 3 explains the datasets, data collection, and preprocessing steps.Section 4 delves into our methods, firstly examining heuristic-based link prediction techniques, then exploring machine learning methods, followed by a detailed look at our advanced graph neural network approaches with community detection.Section 5 discusses the broader implications, limitations, and avenues for future research arising from our study.The paper concludes in section 6, summarizing our contributions and reflecting on the potential impact on network analysis in the realm of scientific literature.

Related Work 2.1. Graph Neural Network
The graph structure in computers belongs to a non-Euclidean structure, which is characterized by irregular arrangement, and it is difficult to define its neighboring nodes for a certain point in the data, and the number of neighboring nodes of different nodes is also different, so it is difficult to be directly applied to deep learning methods such as convolutional neural networks and recurrent data networks.The graph embedding model aims to downscale the graph structure, i.e., mapping a highly dense graph into a low-density graph.Graph embedding algorithms are now widely used in the fields of node classification, link prediction, clustering and visualization [5].
Due to the different graph structures and different algorithmic principles applied in the embedding process, graph embedding models are mainly categorized as follows., graph embedding for graph signal processing, graph embedding for matrix decomposition, graph embedding for random walk, and graph embedding for deep learning [10].Based on the existing research ignoring the correlation between static and dynamic graphs, Yuan et al. simultaneously classified static and dynamic graph models into five categories based on the algorithmic principles applied in the embedding process, namely: graph embedding based on matrix decomposition, graph embedding based on random walk, graph embedding based on self-encoder, graph embedding based on graph neural network, and graph embedding based on other methods [11].
The following summarizes the current mainstream graph embedding algorithm models at home and abroad, and briefly introduces their principles.Singular value decomposition (SVD) is a typical algorithm for the decomposition of adjacency matrices, and SVD essentially employs the idea of vector orthogonal transformation to decompose matrices.Further, on the basis of generalized SVD (GSVD), combined with asymmetric transfer, the HOPE algorithm is derived [12].Node2vec [13], as a classical algorithm for random walk, is inspired by the skip-gram algorithm and DeepWalk algorithm in the language model, and artificially introduces two parameters on the basis of obtaining sequences of neighboring nodes via random walk along the balance the breadth-first traversal and depth-first traversal, so as to take into account the homogeneity and structure of the graph.Due to the strong time-varying nature of the network in real problems, the CTDNE algorithm [14] introduces temporal information on the basis of static graph random walk, i.e., random walk based on the temporal path, which is more suitable for dynamic graph scenarios.Based on the feature decomposition of the Laplace matrix, GCN [15] simultaneously deals with the node features as well as the connections between nodes to realize a new node representation.Based on GCN, GraphSAGE [16] utilizes known nodes to predict unknown nodes through a fixed-size sampling mechanism, and GAT [17] introduces an attention mechanism to assign weights to nodes so that the input of complete graph data is not required, which is also more in line with real-world scenarios.
In addition, since many real-world problems are not pairwise node relations, concepts such as hypergraphs and heterogeneous graphs have arisen.In order to better study these graph structures, some scholars have also proposed graph embedding algorithms for hypergraphs, e.g., DHNE [18], HGNN [19], etc., as well as algorithms for heterogeneous graphs, e.g., HGSL [20], etc.
Building on the extensive survey of graph embedding techniques presented in Section 2, our study specifically focuses on the advancements of GNN methods, particularly GAT, GCN, and GraphSAGE.These methods represent the most directly related prior work to our study due to their state-of-the-art performance in various graph-based tasks and their relevance to our proposed approach.In the context of these foundational GNN models, our work introduces a novel enhancement by integrating the Louvain method for community detection.This integration significantly boosts the models' performance across multiple metrics, including precision, recall, F1 score, and AUC score.The Louvain method's ability to uncover latent community structures provides an additional layer of information, enriching the original node features and thereby enabling the GNN models to achieve a deeper understanding of the graph's topology.This enhancement aligns with the recent trends in deep learning that prioritize the incorporation of structural and contextual information within the learning process, affirming our contribution to the evolving landscape of graph neural network research.

Link Prediction
The following key metrics are mainly used in link prediction algorithms, i.e., Area Under the Curve (AUC) metric, Accuracy, Common Neighbor (CN) metric, Adamic-Adar metric, Resource Allocation metric (RA), Preferred Attachment metric (PA), Random Walk with Restart (RWR) metric and Katz metric.Traditional link prediction algorithms can be divided into two categories: unsupervised and supervised algorithms, where unsupervised algorithms are mainly centered around generating functions, node clustering, likelihood analysis, and other related models, while models under supervised algorithms are more diverse.In recent years, many correlations in graph embedding algorithms have been utilized in link prediction, e.g., DeepWalk algorithm, LINE algorithm.In solving practical problems, since the network structure tends to be more complex and diverse, the main research directions of link prediction models are currently focused on temporal network link prediction, heterogeneous network link prediction and directed graph link prediction.
Links in temporal networks change over time and thus have a more complex and diverse dynamic structure.For link prediction in time-series networks, Rahman et al. transformed the link prediction problem into an optimal coding problem for graph embedding through feature representation extraction, and minimized the reconstruction error through gradient descent, and proposed the Dylink2vec algorithm.This method simultaneously considers the topology of the network and the historical changes of the links, and compares to the traditional link prediction methods based on a single topological structure and DeepWalk based link prediction methods, Dylink2vec experiments yielded higher CN, AA and Katz metrics [21].Lekui Zhou et al. proposed the DynamicTriad algorithm in conjunction with representational learning, which can simultaneously preserve the network's structural and dynamic change information, and through the triadic closure process, the nodes disconnected from the triad are connected to each other, in order to capture the dynamic changes at each moment and performs representation learning [22].Wu et al. established the GETRWR algorithm, which draws on the random walk algorithm in graph embedding, and improves the transfer probability in the RWR metrics by applying the node2vec model, and by introducing a restart factor, i.e., the particles have the probability to return to the initial position in the process of random walk and dynamically calculate the transfer probability of the particles to achieve higher indicators such as AUC and RWR [23].
Nodes and edges in heterogeneous networks have different data types, and many link predictions based on homogeneous networks are often not directly applicable.For link prediction in heterogeneous networks, Dong et al. proposed the metapath2vec algorithm, which firstly uses a random walk model based on metapaths to construct neighboring nodes in the network, and then performs graph embedding operations on them using the skip-gram algorithm, and finally predicts the neighboring nodes of the nodes through a method based on negative sampling of heterogeneous networks.This algorithm can maximize the preservation of the structure and semantics of the heterogeneous network [24].Fu et al. on the other hand, through the construction of the neural network model used for heterogeneous network representation learning, and then use the Hin2vec algorithm to capture the different node relationships and network structure in it, combined with the meta-path to the nodes, and then the vector feature learning to generate the meta-paths for comparative experiments.This algorithm better preserves the contextual information, including the correlation and differentiation between nodes, compared to traditional heterogeneous network link prediction methods [25].Jiang et al. improved the graph convolution neural network, which firstly replaces the traditional graph convolution's transfer layer by a HeGCN layer to make it more suitable for dealing with heterogeneous networks and combines with an adversarial learning layer to obtain node characterization, and then the Adam model is is trained to reduce the value of the loss function, which improves the AUC metrics and PA metrics in link prediction compared to the deep walk and Hin2vec algorithms [26].
Edges in directed networks carry directional information, and the directionality of edges is very common in real-world problems, and ignoring the directional information in them will reduce the model predictability.For link prediction in directed networks, Lichtenwalter et al. proposed the PropFlow algorithm based on random walk algorithm, which assigns parameters to potential connections through a baseline, and the objects will choose routes based on the weights when walking while restricting the direction of the object's walking to the direction of the edges, and the objects do not need to restart or convergence, so the computation time can be reduced [27].Ramesh Sarukkai et al. used Markov chain algorithm in machine learning for link prediction and path research, which can be modeled dynamically according to the historical change condition and combined with eigenvector decomposition to obtain link tours, and this algorithm provides an effective tool for the research of Web path analysis and link prediction [2].Pan et al. applied neural network to the link prediction of directed graph proposed NNLP algorithm, more suitable and combined with the standard particle swarm algorithm to optimize it, and use error back propagation algorithm and BP iterative learning algorithm to train neural network, the training process fusion using CN, AA, PA, Katz and other indicators, thus improving the model's complementarity and accuracy [28].
In addressing the evolving landscape of link prediction, our study employs the same rigorous evaluation metrics as previous research, to ensure comparability and consistency.We have established a robust baseline by encompassing heuristic-based methods and machine learning approaches, which sets the stage for our advanced GNN models.

Data Collection
In this study, we harnessed the Web of Science database to construct a dataset central to the technological advances in zinc batteries, an area increasingly vital due to the surge in demand for green energy solutions and new energy electric vehicles.Recognizing zinc's relative abundance compared to lithium, and its potential for industrial application, our search targeted a corpus reflective of the most recent and impactful research.
We retrieved a comprehensive collection of 10,187 research articles using a refined search strategy based on relevant keywords that signify the core technologies and innovations in the field: TI=(Zinc) AND TI=(Anode) OR TI=(Zn) AND TI=(Anode) OR TI=(Zn) AND TI=(Batter*) OR TI=(Zinc) AND TI=(Batter*) [29].This strategy was meticulously designed to capture the breadth and depth of contemporary zinc battery research, which is predominantly represented in publications post-2010, accounting for over 90% of the total corpus and indicating a burgeoning interest and development in this domain.The dataset includes articles until November 2023, with a significant concentration of papers from authoritative journals such as the Journal of Power Sources, Chemical Engineering Journal, ACS Applied Materials & Interfaces, Journal of the Electrochemical Society, and Journal of Materials Chemistry A. This approach ensures that our analysis is grounded in the most current and influential studies, providing a robust foundation for our subsequent predictive modeling and network analysis.

Train-Test Split
In our study, a network graph was generated from an edge dataset, representing the connections between nodes.This graph was then divided into training and test sets, adhering to a standard 80:20 split ratio, resulting in a training set comprising 80% of the data and a test set with the remaining 20%.
To counter the issue of class imbalance typically present in network data, we employed a novel approach to generate synthetic non-links.This procedure was continued until the number of non-links equaled the number of actual links within the training set, ensuring a balanced distribution of classes.This method was mirrored in the preparation of the test set.As a result, the training set expanded to include 296,574 samples, while the test set contained 74,144 samples.This deliberate oversampling strategy for the training set was critical for developing a model that can accurately predict both the existence and absence of links, which is essential for robust network analysis.

Overview
Table 1 provides a summarized comparison of various heuristic-based methods for link prediction in terms of their performance metrics: Precision, Recall, F1 Score, and AUC Score.The table lists five methods, from a baseline 'Random Prediction' to more sophisticated approaches like 'Resource Allocation'.Each method's efficacy is quantified, with 'Adamic/Adar' showing the highest precision and 'Common Neighbor' exhibiting the best balance between precision and recall as reflected by its F1 Score.

Random Prediction
A random prediction method is employed as a baseline model for evaluating the test set.This method involves randomly assigning binary labels (0 or 1) to each element of the test set, effectively simulating a naive approach where predictions are made without any data-driven learning or inference.This technique establishes a foundational benchmark for assessing the efficacy of more sophisticated models.
The outcomes of this random prediction strategy are quantified using standard performance metrics.The results demonstrate equal values across all metrics with a precision, recall, F1 score, and AUC (Area Under the Curve) score, all at approximately 0.5002.These metrics indicate that the model performs at a level akin to random guessing, as expected from this method.The consistency across all metrics underscores the non-discriminatory nature of the random prediction approach.These results set a baseline that more advanced, data-informed models must exceed to be considered valuable in predictive analytics within this domain.

Common Neighbors
In this segment of the analysis, the Common Neighbors Method is implemented as a technique for link prediction.This method operates on the principle that two nodes in a network are more likely to form a link if they share a significant number of common neighbors.The approach involves calculating the number of common neighbors for each pair of nodes in the test set, and a prediction of a link (label 1) is made if the common neighbors count is one or more; otherwise, no link (label 0) is predicted.
The performance of the Common Neighbors Method is evaluated through several metrics, demonstrating notable effectiveness in predicting links.The precision, which measures the correctness of predicted links, is 0.8437, suggesting that a significant proportion of links predicted by the model are actual links.The recall of 0.8691 implies that the method is effective in identifying actual links within the test set.The F1 score, a balance between precision and recall, stands at 0.8562, further attesting to the method's robustness.Finally, the AUC score of 0.8540 indicates a strong ability to differentiate between the presence and absence of links.These results suggest that the Common Neighbors Method is a reliable tool in network analysis for link prediction, substantially surpassing the baseline established by the random prediction approach.

Jaccard Coefficient
The Jaccard Coefficient Method [30], another approach for link prediction, is based on the similarity between the sets of neighbors of two nodes.The Jaccard coefficient is calculated as the ratio of the size of the intersection to the size of the union of the neighbor sets of two nodes.Predictions are made based on a set threshold; in this case, a link is predicted if the Jaccard score is equal to or exceeds 0.01.
The performance metrics for the Jaccard Coefficient Method show its effectiveness in link prediction.The precision is 0.8537, suggesting that a significant portion of the predicted links are true links.The recall of 0.7862 demonstrates the method's capability in identifying a majority of actual links.The F1 score, at 0.8186, reflects a balanced measure of precision and recall.Additionally, an AUC score of 0.8257 highlights the method's proficiency in distinguishing between link presence and absence.These results showcase the Jaccard Coefficient Method as a viable and efficient tool for predicting links in network analysis.

Adamic/Adar Index
The Adamic/Adar Index Method [31] is implemented for link prediction, focusing on the shared neighbors of two nodes.The method computes the Adamic/Adar index, which accounts for the number of common neighbors and inversely weights these neighbors by the logarithm of their degrees.A threshold is set (0.5 in this case) to decide whether a link is predicted based on the index value.
The method's performance is summarized as follows: it achieves a precision of 0.9421, a recall of 0.6514, an F1 score of 0.7702, and an AUC score of 0.8057.These metrics indicate that while the method is highly precise in its predictions, its recall is comparatively lower, suggesting a more conservative prediction of links.The overall results demonstrate the method's efficiency in discerning link presence in the network.

Resource Allocation Index
The Resource Allocation Index Method [32], applied here for link prediction, is based on the idea of allocating resources through common neighbors between pairs of nodes.The Resource Allocation (RA) index is calculated for each pair in the test set, taking into account the degree of common neighbors.A predefined threshold (0.01 in this case) is used to determine if a link is predicted based on the RA index score.
The performance metrics for the Resource Allocation Index Method are as follows: a precision of 0.9144, indicating a high degree of accuracy in the predicted links; a recall of 0.7662, showing the method's ability to identify a substantial portion of actual links; and an F1 score of 0.8337, reflecting a balanced trade-off between precision and recall.The AUC score is 0.8472, suggesting a strong discriminatory ability between link presence and absence.

Overview
Table 2 delineates the performance of three machine learning algorithms-Logistic Regression, Random Forest, and XGBoost-on link prediction tasks, evaluated by Precision, Recall, F1 Score, and AUC Score.It highlights that while XGBoost achieves the highest precision, Random Forest outperforms the others in terms of the F1 Score, suggesting a more balanced performance in precision and recall.

Feature Engineering
The feature engineering phase in our study is a pivotal step towards enabling effective network analysis.Initially, we standardize numerical data using a standard scaler to ensure mean normalization and variance scaling.For categorical variables, a one-hot encoding scheme [33] is applied to convert these variables into a binary vector representation, avoiding any implicit ordering that could be mistakenly inferred by the models.Textual data undergoes a transformation via TF-IDF (Term Frequency-Inverse Document Frequency) vectorization [34], a technique that reflects the importance of words relative to a document collection, hence allowing for the nuanced representation of text features.The culmination of these preprocessing techniques is a multi-dimensional feature space, structured as a sparse matrix to manage the high dimensionality efficiently.The final feature space encompasses a total of 10,187 samples, each characterized by a feature vector consisting of 2,026 distinct elements.This feature space's dimensionality was carefully curated to capture the rich, underlying structures of the node attributes while avoiding the pitfalls of overfitting that can occur with excessively high-dimensional data.The resulting dataset, formed through careful curation and transformation of node characteristics, provides a comprehensive and robust foundation for the subsequent phases of predictive modeling in our network analysis.

Logistic Regression
In this study, a Logistic Regression model [35] is employed as a machine learning method for link prediction in the network.Logistic Regression, a widely used statistical model for binary classification, is trained on the engineered feature set, which encapsulates the characteristics of node pairs in the network.The model is optimized to predict the likelihood of link existence between nodes based on the features derived from their attributes.
The performance of the Logistic Regression model is evaluated using standard metrics on both training and test datasets.In the training phase, the model achieves a precision of 0.7242, a recall of 0.7220, an F1 score of 0.7231, and an AUC score of 0.7235.For the test set, the model demonstrates a precision of 0.7799, a recall of 0.7090, an F1 score of 0.7428, and an AUC score of 0.7544.These metrics indicate the model's effectiveness in accurately predicting links, with a notable performance improvement in the test set compared to the training set.This suggests that the Logistic Regression model, trained on the comprehensive feature set, is a robust tool for link prediction in network analysis.

Random Forest
In this phase of the study, a RandomForestClassifier, a powerful ensemble learning method, is utilized for link prediction in the network.The RandomForest model [36] [37], known for its robustness and ability to handle complex datasets, is trained on the prepared feature set, which includes a diverse range of node attributes and their interactions.
The RandomForest model's performance is assessed using a variety of metrics.During training, the model demonstrates exceptionally high precision, recall, F1 score, and AUC score, all approximately 0.9986, indicating an almost perfect fit to the training data.However, when evaluated on the test set, the model exhibits a precision of 0.8379, a recall of 0.7926, an F1 score of 0.8146, and an AUC score of 0.8196.These results reflect the model's effectiveness in generalizing to new data, with a notable distinction between its training and testing performance.The high precision on the test set suggests the model's strong capability in accurately identifying true links, while the recall indicates its efficiency in capturing a substantial proportion of actual links.These metrics collectively demonstrate the RandomForest model's efficacy as a robust tool for predicting links in network analysis.

XGBoost
The study employs the XGBoost classifier, an advanced gradient boosting framework, for the task of link prediction in networks.XGBoost [38] is renowned for its high efficiency and effectiveness, particularly in dealing with complex datasets.It operates by sequentially building trees, where each new tree attempts to correct the errors made by the previous ones, resulting in a strong predictive model.
In terms of performance, the XGBoost model demonstrates a balanced and effective capability in both the training and testing phases.The training metrics reveal an accuracy of 0.7948, precision of 0.8186, recall of 0.7574, F1 score of 0.7869, and an AUC score of 0.7948.The test metrics exhibit an accuracy of 0.8095, precision of 0.8637, recall of 0.7349, F1 score of 0.7941, and an AUC score of 0.8095.These results indicate the model's strong performance in accurately predicting links, with particularly high precision in the test set, suggesting its ability to generalize well to new data.The XGBoost model thus proves to be a robust and reliable tool in the context of network analysis for link prediction.

Graph Neural Networks
Figure 2 illustrates the process by which the GNN models utilize subgraphs observed from the input graph to predict links.This visual representation clarifies the methodology underlying the application of various GNN variants and demonstrates the nuanced approach of our analysis.

Overview
Table 3 shows the comparison between Graph Neural Network (GNN) models with and without the integration of Louvain community detection reveals the impact of incorporating structural community information into link prediction tasks.The results indicate a consistent improvement in model performance when community detection is applied.
For the GAT model, the addition of Louvain community detection leads to increases in precision, recall, F1 score, and AUC score, suggesting that community context enhances the model's ability to predict links accurately.Similar trends are observed in the GATv2 model, where integrating community features results in a rise in all evaluated metrics, notably in the AUC score.
The GCN model, already exhibiting strong performance, further benefits from the inclusion of community detection, as evidenced by the increase in precision, recall, F1 score, and AUC score.This improvement is also mirrored in the GCNv2 model, although the precision slightly decreases while the other metrics show enhancements.
Lastly, the GraphSAGE model shows considerable improvements across all metrics with the inclusion of Louvain community detection.This suggests that adding community features is particularly effective for models like GraphSAGE, which rely heavily on neighborhood aggregation.
The confusion matrices presented in figure A1 correspond to the models listed in table, with the same ordering.They illustrate the performance of each model on a consistent test set, with a fixed random seed of 42 to ensure reproducibility.The left side of each matrix pertains to models using only the original node features, while the right side shows the performance after the integration of community detection features via the Louvain method.

Graph Attention Networks
In this study, a Graph Attention Network (GAT) is employed for link prediction, leveraging the power of attention mechanisms in graph neural networks.The GAT model uses multiple attention heads to focus on different parts of the node's neighborhood, allowing for a more nuanced aggregation of neighbor information.This structure is particularly effective in capturing the complex dependencies within the graph data.
The GAT model achieves a precision of 0.6065 and a recall of 0.9231 in link prediction, with an F1 score of 0.7320 and an AUC score of 0.8225.These metrics underscore its effectiveness in identifying relevant links and its strong overall predictive performance.
The study employs GATv2 [39], an evolution of the Graph Attention Network (GAT), which incorporates enhanced attention mechanisms for link prediction.GATv2 is characterized by its attention heads that enable more nuanced feature aggregation from a node's neighborhood, effectively capturing complex interactions within the graph.
In testing, the GATv2 model demonstrates a precision of 0.5824 and a recall of 0.9317, culminating in an F1 score of 0.7168.The AUC score stands at 0.8118, indicating its proficient predictive ability in distinguishing between the presence and absence of links.
Figure 3 presents the training loss trajectories for both the GAT and GATv2 models over a span of 300 epochs, providing a visual representation of the models' learning progress.The GAT model's loss, denoted by triangles, shows an initial rapid decline, stabilizing as the epochs increase, which is indicative of the model effectively learning from the training data.Similarly, the GATv2 model, begins with a higher loss that quickly descends, albeit with some early fluctuations, before it too stabilizes.This pattern is reflective of the GATv2's advanced attention mechanisms adapting to the graph structure over time.The colors selected for the plot are consistent with those used in previous figures for ease of comparison.In our experimental validation, the hyperparameters for the GAT and GATv2 model were optimized through comprehensive testing.The final configuration, yielding the most efficacious results, is delineated in table A1 and A2.

Graph Convolutional Networks
In this segment of the research, a Graph Convolutional Network (GCN) is implemented for link prediction in network analysis.This GCN includes additional layers and dropout for regularization, enhancing its capability to capture complex patterns in graph data.The model, by operating directly on the graph structure, efficiently aggregates and transforms node features, leveraging the inherent topological information.
The GCN model demonstrates commendable performance in link prediction, achieving a precision of 0.6638, a recall of 0.9570, and an F1 score of 0.7839.Its AUC score of 0.8921 highlights its effective discrimination between link presence and absence, underscoring the model's robustness in network analysis.
The GCNv2 model [40], an advanced version of the Graph Convolutional Network (GCN), is implemented for link prediction.The GCNv2 model integrates additional convolution layers and a linear transformation layer, enhancing its capacity to process complex graph structures.Key parameters such as alpha and theta are adjusted to optimize the model's performance.
The GCNv2 model exhibits impressive results in link prediction, achieving a precision of 0.6796, a recall of 0.9697, and an F1 score of 0.7991.Its AUC score of 0.9199 further demonstrates its strong capability in differentiating between the presence and absence of links in the network.
Figure 4 displays the training loss trajectories for the GCN and GCNv2 models across 300 training epochs.The GCN model's losses are indicated with circles and exhibit a gradual descent with some variability, suggesting an incremental learning process.In contrast, the GCNv2 model, marked with squares, shows a steeper initial decline in loss, indicating a rapid adaptation to the data.This steep decline in the early epochs for GCNv2 can be attributed to the enhanced convolution layers and linear transformation layer that allow it to capture complex graph structures more effectively.The hyperparameter tuning undertaken for the GCN and GCNv2 model was systematic, aiming to ascertain an optimal set of configurations for enhancing model performance.The resulting parameters, determined to be most effective through empirical testing, are in table A3 and A4.

GraphSAGE
The study implements GraphSAGE, a Graph Neural Network model known for its effectiveness in learning from large graphs.GraphSAGE utilizes neighborhood sampling and aggregating functions to generate node embeddings, efficiently capturing local graph structures.
The GraphSAGE model demonstrates a precision of 0.6556 and a recall of 0.9392 in link prediction, resulting in an F1 score of 0.7722.Its AUC score of 0.8589 indicates a strong capability in distinguishing between the presence and absence of links.
Figure 5 presents the training loss trajectory for the GraphSAGE model over 300 epochs, providing insight into the model's learning performance throughout the training phase.The loss curve, depicted with a yellow line, begins with a sharp decline, indicating a rapid initial learning phase, and then transitions into a more gradual descent.This suggests that GraphSAGE quickly assimilates the local graph structures and then refines its understanding of the data over time.The relative stability of the loss after the initial epochs showcases the efficiency of GraphSAGE's neighborhood sampling and aggregation functions in generating informative node embeddings.The hyperparameter tuning process was executed with the goal of identifying the most effective model parameters.The selected hyperparameters for the Enhanced GraphSAGE model are detailed in the table A5.

Graph Neural Networks with Community Detection
In the enhanced methodology of our study, the preprocessing of graph data is elevated through the integration of the Louvain community detection algorithm [41].The Louvain method operates on the principle of modularity optimization.Mathematically, it seeks to maximize the modularity Q, defined as where A is the adjacency matrix of the network, m is the sum of all edge weights, k i and k j are the degrees of nodes i and j, and δ is the Kronecker delta function that equals 1 if nodes i and j are in the same community and 0 otherwise.
By applying this algorithm, each node is endowed with a community label that encapsulates its modular connectivity within the graph.These labels are then encoded into a one-hot vector, expanding the original node feature set-previously vectorized by TF-IDF and one-hot encoding techniques-by appending this new community structure information.Consequently, the feature space dimensions increase from 2,026 to 3,315, enriching the input data for the GNN models.
The GNN models-GCN, GCNv2, GAT, GATv2, and GraphSAGE-are subsequently trained on this augmented dataset, harnessing not only the intrinsic node features but also the macro-structural properties encoded by the Louvain method.This dual representation fosters a deeper interpretative capacity within the models, leading to an improved learning trajectory as observed in Figure 6.The loss trajectories for models trained with community detection features indicate a more pronounced and rapid stabilization, highlighting the effectiveness of the Louvain method in enhancing the network analysis capabilities of GNNs.

Limitation
The Louvain community detection algorithm, when applied in Graph Neural Networks for link prediction, may encounter a resolution limit issue.This limitation arises because the algorithm is prone to merging smaller communities into larger ones, thereby potentially overlooking finer, yet significant, community structures.This oversight can lead to suboptimal performance in link prediction tasks where the identification of smaller, more nuanced communities is crucial for accurate predictions.

Future Work
As we look to the future, the refinement of integrating community detection with GNNs stands as a promising avenue for advancing link prediction techniques.Beyond simply enhancing the accuracy of link predictions, the amalgamation of these two methodologies has the potential to significantly influence the broader field of network analysis.The nuanced understanding of community structures within networks could revolutionize the way we approach complex networked systems, enabling a deeper comprehension of their evolution, resilience, and underlying dynamics.
In the realm of scientific literature, the implications of this research are manifold.By improving paper recommendation systems, our approach can streamline the research process, enabling scholars to discover relevant works with unprecedented precision.This is particularly vital in an era where the sheer volume of publications can overwhelm researchers.Moreover, by anticipating collaboration and citation patterns, we can not only track but also predict the progression of scientific discourse, potentially catalyzing new research intersections and collaborations that might have otherwise remained undiscovered.
The integration of community detection with GNNs could also be extended to other domains where network analysis plays a critical role, such as social network analysis, epidemiology, and bioinformatics.For instance, in social networks, this methodology could enhance the detection of community structures and their changes over time, providing insights into the formation of social groups and the spread of information or misinformation.In epidemiology, it can aid in understanding and predicting the spread of diseases within and between communities, potentially informing public health strategies and interventions.In bioinformatics, it could improve the prediction of protein-protein interactions, which is crucial for understanding cellular processes and the development of drugs.
Furthermore, exploring alternative community detection algorithms can address current limitations and open new research directions, such as the study of community evolution in dynamic networks.The potential to adapt GNN models to better incorporate temporal and structural changes within communities could lead to groundbreaking developments in dynamic network analysis.
In summary, the integration of community detection methods with GNN architectures represents not just a methodological enhancement in link prediction tasks but also a strategic shift towards a more sophisticated understanding of networks.This could lead to developing more intelligent systems that can adapt to and predict changes within networks, offering significant benefits across various scientific and social domains.Hence, our future work aims not only to improve upon the technical aspects of our approach but also to explore its broader applications, thereby contributing to the progressive evolution of network science and its capacity to address complex real-world problems.

Conclusion
In conclusion, our study provides compelling quantitative evidence that the integration of the Louvain community detection method with various GNN models results in robust performance enhancements in link prediction tasks.For instance, when Louvain is applied to GNN models, we observe significant improvements in AUC scores, such as an increase from 0.777 to 0.823 for GAT and from 0.892 to 0.917 for GCNv2.These improvements are consistent across different GNN architectures, underlining the efficacy of incorporating community structures into predictive modeling.
Furthermore, our approach of Louvain-enhanced GNNs not only consistently outperforms traditional machine learning methods, as evidenced by the highest AUC score of 0.917 achieved by GCNv2 with Louvain, but also demonstrates superior results compared to heuristic-based methods, with the Resource Allocation index achieving an AUC score of 0.847.Such findings advocate for a paradigm shift towards the adoption of community-aware GNN models for link prediction, highlighting the importance of a synergistic perspective that blends node-specific features with macroscopic network structures.This study, therefore, concludes that leveraging the latent community information within GNN architectures can substantially elevate the performance of link prediction models, setting a new benchmark for future research in this domain.

Appendix B. Analysis of Confusion Matrices for GNN Models
The inclusion of confusion matrices in this study serves to provide a detailed assessment of the classification performance of various GNN models, both with and without the enhancement of the Louvain method.Confusion matrices offer a comprehensive visualization of the model's predictive capabilities, displaying the true positives, false positives, true negatives, and false negatives.This information is crucial for understanding the specific types of errors made by the models and for evaluating their performance beyond aggregated metrics like precision, recall, and F1 score.

Figure 1 .
Figure 1.General idea which integrates community detection algorithms with GNN models.
Cai et al. classified graph embedding models into four categories according to the different structures of the input graphs, i.e., matrix decomposition-based graph embedding, deep learning-based graph embedding, edge reconstruction-based graph embedding, graph kernel-based graph embedding, and generative model-based graph embedding [6].Goyal et al. classified graph embedding into three categories according to the model utilized in the embedding process, i.e., factorization-based graph embedding model, random walk-based graph embedding model, and deep learning-based graph embedding model [7].Chen et al. classified graph embedding models into seven categories by incorporating the type of the graph along with the model utilized in the embedding process as the basis for the classification, i.e., dimensionality reduction-based graph embedding, random walk-based graph embedding, matrix decomposition-based graph embedding, neural network-based graph embedding, graph embedding for large graphs, graph embedding for hypergraphs, and graph embedding based on attentional mechanisms [8].Xie et al. classified the dynamic networks into five categories based on whether they are of snapshot type or continuous time type, and in accordance with their different graph embedding strategies, i.e.: eigenvalue decomposition-based graph embedding, Skim-Gram model-based graph embedding, selfencoding-based graph embedding, and graph convolution-based graph embedding [9].Xia et al. classified graph embedding algorithms into four categories based on graph learning techniques, i.e.

Figure 2 .
Figure 2. The general methodology of different GNN variants.

Figure 3 .
Figure 3. Training loss over epochs for GAT models.

Figure 4 .
Figure 4. Training loss over epochs for GCN models.

Figure 5 .
Figure 5. Training loss over epochs for GraphSAGE model.

Figure 6 .
Figure 6.Training loss over epochs for different models.

Figure A1 .
Figure A1.Confusion matrices of different models.

Table 1 .
Heuristic-Based Methods Overview

Table 2 .
Machine Learning Methods Overview

Table 3 .
Graph Neural Networks Overview

Table A5 .
Optimal Configuration for the Enhanced GraphSAGE Model