FONDUE: A Framework for Node Disambiguation and Deduplication Using Network Embeddings †

graph that represent the same real-life entity, and for detecting and optimally splitting nodes that represent multiple distinct real-life entities. FONDUE does this in an entirely unsupervised fashion, relying exclusively on the topology of the network. Abstract: Data often have a relational nature that is most easily expressed in a network form, with its main components consisting of nodes that represent real objects and links that signify the relations between these objects. Modeling networks is useful for many purposes, but the efﬁcacy of downstream tasks is often hampered by data quality issues related to their construction. In many constructed networks, ambiguity may arise when a node corresponds to multiple concepts. Similarly, a single entity can be mistakenly represented by several different nodes. In this paper, we formalize both the node disambiguation (NDA) and node deduplication (NDD) tasks to resolve these data quality issues. We then introduce FONDUE, a framework for utilizing network embedding methods for data-driven disambiguation and deduplication of nodes. Given an undirected and unweighted network, FONDUE-NDA identiﬁes nodes that appear to correspond to multiple entities for subsequent splitting and suggests how to split them (node disambiguation), whereas FONDUE-NDD identiﬁes nodes that appear to correspond to same entity for merging (node deduplication), using only the network topology. From controlled experiments on benchmark networks, we ﬁnd that FONDUE-NDA is substantially and consistently more accurate with lower computational cost in identifying ambiguous nodes, and that FONDUE-NDD is a competitive alternative for node deduplication, when compared to state-of-the-art alternatives.


Introduction
Increasingly, collected data naturally comes in the form of a network of interrelated entities. Examples include social networks describing social relations between people (e.g., Facebook), citation networks describing the citation relations between papers (e.g., PubMed [1]), biological networks, such as those describing interactions between proteins (e.g., DIP [2]), and knowledge graphs describing relations between concepts or objects (e.g., DBPedia [3]). Thus, new machine learning, data mining, and information retrieval methods are increasingly targeting data in their native network representation.
An important problem across all the fields of data science, broadly speaking, is data quality. For problems on networks, especially those that are successful in exploiting fine-as well as coarse-grained structure of networks, ensuring good data quality is perhaps even more important than in standard tabular data. For example, an incorrect edge can have a dramatic effect on the implicit representation of other nodes, by dramatically changing

The Node Disambiguation Problem
We address the problem of NDA in the most basic setting: given a network, unweighted, unlabeled, and undirected, the task considered is to identify nodes that correspond to multiple distinct real-life entities. We formulate this as an inverse problem, where we use the given ambiguous network (which contains ambiguous nodes) in order to retrieve the unambiguous network (in which all nodes are unambiguous). Clearly, this inverse problem is ill-posed, making it impossible to solve without additional information (which we do not want to assume) or an inductive bias.
The key insight in this paper is that such an inductive bias can be provided by the network embedding (NE) literature. This literature has produced embedding-based models that are capable of accurately modeling the connectivity of real-life networks down to the node-level, while being unable to accurately model random networks [4,5]. Inspired by this research, we propose to use as an inductive bias the fact that the unambiguous network must be easy to model using a NE. Thus, we introduce FONDUE-NDA, a method that identifies nodes as ambiguous if, after splitting, they maximally improve the quality of the resulting NE.
Example 1. Figure 1a illustrates the idea of FONDUE for NDA applied on a single node. In this example, node i with embedding x i corresponds to two real-life entities that belong to two separate communities, visualized by either full or dashed lines, to highlight the distinction. Because node i is connected to two different communities, most NE methods would locate its embedding x i between the embeddings of the nodes from both communities. Figure 1b shows a split of node i into nodes i and i , each with connections only to one of both communities. The resulting network is easy to embed by most NE methods, with embeddings x i and x i close to their own respective communities.
In contrast, Figure 1c shows a split where the two resulting nodes are harder to embed. Most NE methods would embed them between both communities, but substantial tension would remain, resulting in a worse value of the NE objective function.

The Node Deduplication Problem
The same inductive bias can be used also for the NDD problem. The NDD problem is that given a network, unweighted, unlabeled, and undirected, identify distinct nodes that correspond to the same real-life entity. To this end, FONDUE-NDD determines how well merging two given nodes into one would improve the embedding quality of NE models. The inductive bias considers a merge as better than another one if it results in a better value of the NE objective function.
The diagram in Figure 2 shows the suggested pipeline for tackling both problems.  Figure 2. FONDUE pipeline for both NDA and NDD. Data corruption can lead to two types of problems: node ambiguation (e.g., multiple authors sharing the same name represented with one node in the network) in the left part of the diagram, and node duplication (e.g., one author with name variation represented by more than 1 node in the network). We then define two tasks to resolve both problems separately using FONDUE.

Contributions
In this paper, we make a number of related contributions: • We propose FONDUE, a framework exploiting the empirical observation that naturally occurring networks can be embedded well using state-of-the-art NE methods, to tackle two distinct tasks: node deduplication (FONDUE-NDD) and node disambiguation (FONDUE-NDA). The former, by identifying nodes as more likely to be duplicated if contracting them enhances the quality of an optimal NE. The latter, by identifying nodes as more likely to be ambiguous if splitting them enhances the quality of an optimal NE; • In addition this conceptual contribution, substantial challenges had to be overcome to implement this idea in a scalable manner. Specifically for the NDA problem, through a first-order analysis we derive a fast approximation of the expected NE quality improvement after splitting a node; • We implemented this idea for CNE [6], a recent state-of-the-art NE method, although we demonstrate that the approach can be applied for a broad class of other NE methods as well; • We tackle the NDA problem, with extensive experiments over a wide range of networks demonstrate the superiority of FONDUE over the state-of-the-art for the identification of ambiguous nodes, and this at comparable computational cost; • We also empirically observe that, somewhat surprisingly, despite the increase in accuracy for identifying ambiguous nodes, no such improvement was observed for the ambiguous node splitting accuracy. Thus, for NDA, we recommend using FONDUE for the identification of ambiguous nodes, while using an existing state-of-the-art approach for optimally splitting them; • Experiments on four datasets for NDD demonstrate the viability of FONDUE-NDD for the NDD problem based on only the topological features of a network.

Related Work
The problem of NDA differs from named-entity disambiguation (NED; also known as named entity linking), a natural language processing (NLP) task where the purpose is to identify which real-life entity from a list a named-entity in a text refers to. For example, in the ArnetMiner dataset [7] 'Bin Zhu' corresponds to more than 10+ authors. The Open Researcher and Contributor ID (ORCID) [8] was introduced to solve the author name ambiguity problem, and most NED methods rely on ORCID for labeling datasets.
In [7], they exploit hidden Markov random fields in a unified probabilistic framework to model node and edge features. On the other hand, Zhang et al. [12] designed a comprehensive framework to tackle name disambiguation, using complex feature engineering approach. By constructing paper networks, using the information sharing between two papers to build a supervised model for assigning the weights of the edges of the paper network. If two nodes in the network are connected, they are more likely to be authored by the same person.
Recent approaches are increasingly relying on more complex data, Ma et al. [13] used heterogeneous bibliographic networks representation learning, by employing relational and paper-related textual features, to obtain the embeddings of multiple types of nodes, while using meta-path based proximity measures to evaluate the neighboring and structural similarities of node embedding in the heterogeneous graphs.
The work of Zhang et al. [9] focusing on preserving privacy using solely the link information in a graph, employs network embedding as an intermediate step to perform NED, but they rely on other networks (person-document and document-document) in addition to person-person network to perform the task.
Although NDA could be used to assist in NED tasks, NED typically strongly relies on the text, e.g., by characterizing the context in which the named entity occurs (e.g., paper topic) [14]. Similarly, Ma et al. [15] proposes a name disambiguation model based on representation learning employing attributes and network connections, by first encoding the attributes of each paper using variational graph auto-encoder, then computing a similarity metric from the relationship of these attributes, and then using graph embedding to leverage the author relationships, heavily relying on NLP.
In NDA, in contrast, no natural language is considered, and the goal is to rely on just the network's connectivity in order to identify which nodes may correspond to multiple distinct entities. Moreover, NDA does not assume the availability of a list of known unambiguous entity identifiers, such that an important part of the challenge is to identify which nodes are ambiguous in the first place. This offers a more privacy-friendly advantage and extends the application towards more datasets where access to additional information is restricted or not possible.
The research by Saha et al. [16], and Hermansson et al. [17] is most closely related to ours. These papers also only use topological information of the network for NDA. Yet, Ref. [16] also require timestamps for the edges, while [17] require a training set of nodes labeled as ambiguous and non-ambiguous. Moreover, even though the method proposed by [16] is reportedly orders of magnitude faster than the one proposed by [17], it remains computationally substantially more demanding than FONDUE (e.g., [16] evaluate their method on networks with just 150 entities). Other recent work using NE for NED [9,[18][19][20] is only related indirectly as they rely on additional information besides the topology of the network.
The literature on NDD is scarce, as the problem is not well-defined. Conceptually, it is similar to that of named entity linking (NEL) [11,21] problem which aims to link instances of named entities in a text such as a newspaper, articles to the corresponding entities, often in knowledge bases (KB). Consequently, NEL heavily relies on textual data to identify erroneous entities rather than entity connection which is the core of our method. KB approaches for NEL are dominant in the field [22,23], as they make use of knowledge base datasets, heavily relying on labeled and additional graph data to tackle the named entity linking task. This also poses a challenge when it comes to benchmarking our method for NDD. No identified studies that tackles NDD from a topological approach is present in the current literature, at least without reliance on additional attributes and features.

Methods
Section 3.1 formally defines the NDA and NDD problems. Section 3.2 introduces the FONDUE framework in a maximally generic manner, independent of the specific NE method it is applied to, or the task (NDD or NDE) it is used for. A scalable approximation of FONDUE-NDA is described throughout Section 3.3, and applied to CNE as a specific NE method. Section 3.4 details the FONDUE-NDD method used for NDD.
Throughout this paper, a bold uppercase letter denotes a matrix (e.g., A), a bold lower case letter denotes a column vector (e.g., x i ), (.) denotes matrix transpose (e.g., A ), and . denotes the Frobenius norm of a matrix (e.g., A ).

Problem Definition
We denote an undirected, unweighted, unlabeled graph as G = (V, E), with V = {1, 2, . . . , n} the set of n nodes (or vertices), and E ⊆ ( V 2 ) the set of edges (or links) between these nodes. We also define the adjacency matrix of a graph G, denoted A ∈ {0, 1} n×n , with A ij = 1 if {i, j} ∈ E. We denote a i ∈ {0, 1} n as the adjacency vector for node i, i.e., the ith column of the adjacency matrix A, and Γ(i) = {j | {i, j} ∈ E} the set of neighbors of i.

Formalizing the Node Disambiguation Problem
To formally define the NDA problem as an inverse problem, we first need to define the forward problem which maps an unambiguous graph onto an ambiguous one. This formalizes the 'corruption' process that creates ambiguity in the graph. In practice, this happens most often because identifiers of the entities represented by the nodes are not unique. For example, in a co-authorship network, the identifiers could be non-unique author names. To this end, we define a node contraction: Definition 1 (Node Contraction). A node contraction c for a graph G = (V, E) with V = {1, 2, . . . , n} is a surjective function c : V →V for some setV = {1, 2, . . . ,n} withn ≤ n. For convenience, we will define c −1 :V → 2 V as c −1 (i) = {k ∈ V|c(k) = i} for any i ∈V. Moreover, we will refer to the cardinality |c −1 (i)| as the multiplicity of the node i ∈V.
A node contraction defines an equivalence relation ∼ c over the set of nodes: i ∼ c j if c(i) = c(j), and the setV is the quotient set V/ ∼ c . Continuing our example of a coauthorship network, a node contraction maps an author onto the node representing their name. Two authors i and j would be equivalent if their names c(i) and c(j) are equal, and the multiplicity of a node is the number of distinct authors with the corresponding name.
We can naturally define the concept of an ambiguous graph in terms of the contraction operation, as follows.
Definition 2 (Ambiguous graph). Given a graph G = (V, E) and a node contraction c for that graph, the graphĜ = (V,Ê) defined asÊ = {{c(k), c(l)}|{k, l} ∈ E} is referred to as an ambiguous graph of G. Overloading notation, we may writeĜ c(G). To contrast G withĜ, we may refer to G as the unambiguous graph.
Continuing the example of the co-authorship network, the contraction operation can be thought of as the operation that replaces author identities with their names, which may map distinct authors onto a same shared name. Note that the symbols for the ambiguous graph and its set of nodes and edges are denoted here using hats, to indicate that in the NDA problem we are interested in situations where the ambiguous graph is the empirically observed graph.
We can now formally define the NDA problem as inverting this contraction operation: Definition 3 (The Node Disambiguation Problem). Given an ambiguous graphĜ = (V,Ê), NDA aims to retrieve the unambiguous graph G = (V, E) and associated node contraction c, i.e., a contraction c for which c(G) =Ĝ.
To be more precise, it suffices to identify G up to an isomorphism, as the actual identifiers of the nodes are irrelevant.

Formalizing the Node Deduplication Problem
The NDD problem can be formalized as the converse of the NDA problem, also relying on the concept of node contractions. First, a duplicate graph can be defined as follows: Definition 4 (Duplicate graph). Given a graph G = (V, E), a graphĜ = (V,Ê) where {k, l} ∈ E ⇒ {c(k), c(l)} ∈ E for an appropriate contraction c, and where for each {i, j} ∈ E there exists an edge {k, l} ∈Ê for which c(k) = i and c(l) = j, is referred to as a duplicate graph of G. Or more concisely, using the overloaded notation from Definition 2, a duplicate graphĜ is a graph for which c(Ĝ) = G. To contrast G withĜ, we may refer to G as the deduplicated graph.
Continuing the example of the co-authorship network, one node in the duplicate graph could correspond to two versions of the name of the same author, such that they are assigned two different nodes in the duplicate graph. A contraction operation that maps duplicate names to their common identity would merge such nodes corresponding to the same author. Hats on top of the symbols of the duplicate graph indicate that in the NDD problem we are interested in the situation where the duplicate graph is the empirically observed one.
The NDD problem can, thus, be formally defined as follows: Definition 5 (The Node Deduplication Problem). Given a duplicate graphĜ = (V,Ê), NDD aims to retrieve the deduplicated graph G = (V, E) and the node contraction c associated withĜ, i.e., for which G = c(Ĝ).

Real Graphs Suffer from Both Issues
Of course, many real graphs both require deduplication and disambiguation. This is particularly true for the running example of the co-authorship network. Yet, while building on the common FONDUE framework, we define and study both problems separately, and propose an algorithm for each in Section 3.3 (for NDA) and Section 3.4 (for NDD). For networks suffering from both problems, both algorithms can be applied concurrently or sequentially without difficulties, thus solving both problems simultaneously.

FONDUE as a Generic Approach
To address both the NDA and NDD problems, FONDUE uses an inductive bias that the non-corrupted (unambiguous and deduplicated) network must be easy to model using NE. This allows us to approach both problems in the context of NE. Here we first formalize the inductive bias of FONDUE (Section 3.2). This will later allow us to present both the FONDUE-NDA (Section 3.3) and FONDUE-NDD (Section 3.4) algorithms, each tackling one of the data corruption tasks (NDA and NDD, respectively).

The FONDUE Induction Bias
Clearly, both the NDA and NDD problems are inverse problems, with NDA an illposed one. Thus, further assumptions, inductive bias, or priors are inevitable in order to solve them. The key hypothesis in FONDUE is that the unambiguous and deduplicated G, considering it is a 'natural' graph, can be embedded well using state-of-the-art NE methods. This hypothesis is inspired by the empirical observation that NE methods embed 'natural' graphs well.
NE methods find a mapping f : V → R d from nodes to d-dimensional real vectors. An embedding is denoted as X = (x 1 , x 2 , . . . , x n ) ∈ R n×d , where x i f (i) for i ∈ V is the embedding of each node. Most well-known NE methods aim to find an optimal embedding X * G for given graph G that minimizes a continuous differentiable cost function O(G, X). Thus, given an ambiguous graphĜ, FONDUE-NDA will search for the graph G, such that c(G) =Ĝ for an appropriate contraction c, while optimizing the NE cost function on G: Definition 6 (NE-based NDA problem). Given an ambiguous graphĜ, NE-based NDA aims to retrieve the unambiguous graph G and the associated contraction c: (1) Ideally, this optimization problem can be solved by simultaneously finding optimal splits for all nodes (i.e., an inverse of the contraction c) that yield the smallest embedding cost after re-embedding. However, this strategy requires to (a) search splits in an exponential search space that has the combinations of splits (with arbitrary cardinality) of all nodes, (b) to evaluate each combination of the splits, the embedding of the resulting network needs to be recomputed. Thus, this ideal solution is computationally intractable and more scalable solutions are needed (see Section 3.3).
Similarly, for NDD, given a duplicate graphĜ, FONDUE-NDD will search for a graph G, such that c(Ĝ) = G for an appropriate contraction c, again while optimizing the NE cost function on G: Definition 7 (NE-based NDD problem). Given a duplicate graphĜ, NE-based NDD aims to retrieve the deduplicated graph G and the associated contraction c ofĜ: Generally speaking, to solve this optimization problem, we would want to find the optimal merging for all the nodes that would reduce the cost of the embedding after computing the re-embedding. Yet, a thorough optimization of this problem is beyond the scope of this paper, and as an approximation we rely on a ranking-based approach where we rank networks with randomly merged nodes depending on the value of the objective function after re-embedding. This may be suboptimal, but it highlights the viability of the concept if used for NDD as shown in the results of the experiments.
Although the principle underlying both methods is thus very similar, we will see below that the corresponding methods differ considerably. In common to them is the need for a basic understanding of NE methods.

FONDUE-NDA
From the above section, it is clear that the NDA problem can be decomposed into two subproblems: 1.
Estimating the multiplicities of all i ∈Ĝ-i.e., the number of unambiguous nodes from G represented by the node fromĜ. This essentially amounts to estimation the contraction c. Note that the number of nodes n in V is then equal to the sum of these multiplicities, and arbitrarily assigning these n nodes to the sets c −1 (i) defines c −1 and, thus, c; 2.
Given c, estimating the edge set E. To ensure that c(G) =Ĝ, for each {i, j} ∈Ê there must exist at least one edge {k, l} ∈ E with k ∈ c −1 (i) and l ∈ c −1 (j). However, this leaves the problem underdetermined (making this problem ill-posed), as there may also exist multiple such edges.
As an inductive bias for the second step, we will additionally assume that the graph G is sparse. Thus, FONDUE-NDA estimates G as the graph with the smallest set E for which c(G) =Ĝ. Practically, this means that an edge {i, j} ∈Ê results in exactly one edge {k, l} ∈ E with k ∈ c −1 (i) and l ∈ c −1 (j), and that equivalent nodes k ∼ c l with k, l ∈ V are never connected by an edge, i.e., {k, l} ∈ E. This bias is justified by the sparsity of most 'natural' graphs, and our experiments indicate it is justified.
We approach the NE-based NDA Problem 6 in a greedy and iterative manner. In each iteration, FONDUE-NDA identifies the node that has a split which will result in the smallest value of the cost function among all nodes. To further reduce the computational complexity, FONDUE-NDA only splits one node into two nodes at a time (e.g., Figure 1b), i.e., it splits node i into two nodes i and i with corresponding adjacency vectors a i , a i ∈ {0, 1} n , a i + a i = a i . We refer to such a split as a binary split. Note that repeated binary splits can of course be used to achieve the same result as a single split into several notes, so this assumption does not imply a loss of generality or applicability. Once the best binary split of the best node is identified, FONDUE-NDA splits that node and starts the next iteration. The evaluation of each split requires recomputing the embedding, and comparing the resulting optimal NE cost functions with each other.
Unfortunately, this naive strategy is computationally intractable: computing a single NE is already computationally demanding for most (if not all) NE methods. Thus, having to compute a re-embedding for all possible splits, even binary ones (there are O(n2 d ) of them, with n the number of nodes and d the maximal degree), is entirely infeasible for practical networks.

A First-Order Approximation for Computational Tractability
Thus, instead of recomputing the embedding, FONDUE-NDA performs a first-order analysis by investigating the effect of an infinitesimal split of a node i around its embedding x i , on the cost O(Ĝ si ,X si ) obtained after performing the splitting, withĜ si andX si referring to the ambiguous graph and its embeddings' representation, respectively, after splitting node i.
Drawing intuition from Figure 1, when two distinct authors share the same name in a given collaboration network, their respective separate community (ego-network) are lumped into one big cluster. Yet, from a topological point of view, that ambiguous node (author name) is connected to both communities that are generally different, meaning they share very few, if any, links. This stems from the observation that it is highly unlikely that two authors with the same exact name would belong to the same community, i.e., collaborate together. Furthermore, splitting this ambiguous node into two different ones (distinguishing the two authors), would ideally separate these two communities. Thus, to do so, we consider that each community, that is supposed to be embedded separately, is pulling the ambiguous node towards its own embedding region, and once separated, the embeddings of each of the resolved nodes will be improved. So our main goal is to quantify the amount of improvements in the embedding cost function by separating the two nodes i and i by a unit distance in a certain direction. We propose to split the assignment of the edges of i between i and i , such that all the links from i are distributed to either i or i in such way to maximize the embedding cost function, which could be evaluated by computing the gradient with respect to the separation distance δ i .
Specifically, FONDUE-NDA seeks the split of node i that will result in embedding This can be completed analytically. Indeed, applying the chain rule, we find: Many recent NE methods like LINE [24] and CNE [6], aim to embed 'similar' nodes in the graph closer to each other, and 'dissimilar' nodes further away from each other (for a particular similarity notion depending on the NE method). For such methods, Equation (3) can be further simplified. Indeed, as such NE methods focus on modeling a property of pairs of nodes (their similarity), their objective functions can be typically decomposed as a summation of node-pair interaction losses over all node-pairs. For example, this can be seen in Section 3.3.3 of the current paper for CNE [6], and in Equations (3) and (6) of [24] for LINE. Each of these node-pair interaction losses quantifies the extent to which the proximity between nodes' embeddings reflects their 'similarity' in the network. For methods where this decomposition is possible, we can thus write the objective function as follows: where O p (A ij ,x i ,x j ) denotes the node-pair interaction loss for the nodes i and j, O p (A ij = 1, x i , x j ) the part of objective function that corresponds to node i and node j with an edge between them (A ij = 1) and O p (A kl = 0, x k , x l ) is the part of objective function, where node k and node l are disconnected.
Given that Γ(i) = Γ(i ) ∪ Γ(i ) and Γ(i ) ∩ Γ(i ) = ∅, we can apply the same decomposition approach on ∇ x i O(Ĝ si ,X si ), Additionally, as both nodes i and i share the same set of non-neighbors of node i, we can write the following: Furthermore, incorporating the previous two decompositions, we can rewrite Equation (3) as follows: , the above equation could be simplified to: be a vector with a dimension corresponding to each of the neighbors j ∈ Γ(i) of i, with value equal to 1 if that neighbor is a neighbor of i and equal to −1 if it is a neighbor of i after splitting i. Then the gradient Equation (4) can be written more concisely and transparently as follows: The aim of FONDUE-NDA is to identify node splits for which the embedding quality improvement is maximal. As argued above, we propose to approximately quantify this by means of a first order approximation, by considering the two-norm squared of this gradient, namely by maximizing , FONDUE-NDA can thus be formalized in the following compact form: Note that M i 0 for all nodes and all splits, such that this is an instance of Boolean quadratic maximization problem [25,26]. This problem is NP-hard, thus it requires further approximations to ensure tractability in practice.

Additional Heuristics for Enhanced Scalability
In order to efficiently search for best split on a given node, we developed two approximation heuristics.
First, we randomly split the neighborhood Γ(i) into two and evaluate the objective (Equation (5)). Repeat the randomization procedure for a fixed number of times, pick the split that gives the best objective value as output.
Second, we find the eigenvector v that corresponds to the largest absolute eigenvalue of matrix M i . Sort the element in vector v and assigning top k corresponding nodes to Γ(i ) and the rest to Γ(i ). Evaluating the objective value for k = 1 . . . |Γ(i)| and pick the best split.
Finally, we combine theses two heuristics and use the split that gives the best objective Equation (5) as the final split of the node i.

FONDUE-NDA Using CNE
We now apply FONDUE-NDA to conditional network embedding (CNE). CNE proposes a probability distribution for network embedding and finds a locally optimal embedding by maximum likelihood estimation. CNE has objective function: Here, the link probabilities P ij conditioned on the embedding are defined as follows: where N +,σ denotes a half-normal distribution [27] with spread parameter σ, σ 2 > σ 1 = 1, and where PÂ ,ij is a prior probability for a link to exist between nodes i and j as inferred from the degrees of the nodes (or based on other information about the structure of the network [28]). First, we derive the gradient: . This allows us to further compute gradient Thus, the Boolean quadratic maximization problem has form:

FONDUE-NDD
Using the inductive bias for the NDD problem, the goal is to minimize the embedding cost after merging the duplicate nodes in the graph (Equation (2)). This is motivated by the fact that natural networks tend to be modeled using NE methods, better than corrupted (duplicate) networks, thus their embedding cost should be lower. Thus, merging (or contracting) duplicate nodes (nodes that refer to the same entity) in a duplicate graphĜ would result in a contracted graphĜ c that is less corrupt (resembling more a "natural" graph), thus with a lower embedding cost.
Contrary to NDA, NDD is more straightforward, as it does not deal with the problem of reassigning the edges of the node after splitting, but rather simply determining the duplicate nodes in a duplicate graph. FONDUE-NDD applied onĜ, aims to find duplicate node-pairs in the graph to combine them into one node by reassigning the union of their edges, which would result in contracted graphĜ c .
Using NE methods, FONDUE-NDD aims to iteratively identify a node-pair {i, j} ∈ V cand , whereV cand is the set of all possible candidate node-pairs, that if merged together to form one node i m , would result in the smallest cost function value among all the other node-pairs. Thus, problem 6 can be further rewritten as: whereĜ c ij is a contracted graph fromĜ after merging the node-pair {i, j} , andX c ij its respective embeddings.
Trying this for all possible node-pairs in the graph is an intractable solution. It is not obvious what information could be used to approximate Equation (8), thus we approach the problem simply by randomly selecting node-pairs, merging them, observing the values of the cost function, and then ranking the result. The lower the cost score, the more likely that those merged nodes are duplicates.
Lacking a scalable bottom-up procedure to identify the best node pairs, in the experiments our focus will be on evaluation whether the introduced criterion for merging is indeed useful to identify whether node pairs appear to be duplicates.

FONDUE-NDD Using CNE
Similarly to the previous section, we proceed by applying CNE as a network embedding method, the objective function of FONDUE-NDD is thus the one of CNE evaluated on the tentatively deduplicated graph after attempting a merge: with the link probabilities P kl conditioned on the embedding are defined as follows: Similarly to Section 3.3.3, N +,σ denotes a half-Normal distribution with spread parameter σ, σ 2 > σ 1 = 1, and where PÂ c ij ,kl ,kl is a prior probability for a link to exist between nodes k and l as inferred from the network properties.

Experiments
In this section, we investigate quantitatively and qualitatively the performance of FONDUE on both semi-synthetic and real-world datasets, compared to state-of-the-art methods tackling the same problems. In Section 4.1, we introduce and discuss the different datasets used in our experiments, in Section 4.2 we discuss the performance of FONDUE-NDA, and FONDUE-NDD in Section 4.3. Finally, in Section 4.4, we summarize and discuss the results. All code used in this section is publicly available from the GitHub repository https://github.com/aida-ugent/fondue, accessed on 20 October 2021.

Datasets
One main challenge for assessing the evaluation of disambiguation tasks is the scarcity of availability of ambiguous (contracted) graph datasets with reliable ground truth. Furthermore, other studies that focus on ambiguous node identification often do not publish their heavily processed dataset (e.g., DBLP datasets [16]), which makes it harder to benchmark different methods. Thus, to simulate data corruption in real world datasets, we opted to create a contracted graph given a source graph, and then use the latter as ground truth to assess the accuracy of FONDUE compared to other baselines. To do so, we used a simple approach for node contraction, for both NDA (Section 4.2.1) and NDD (Section 4.3.1). Below, in Table 1 we list the details of the different datasets used after post-processing in our experiments.
Additionally, we also use real-world networks containing ambiguous and duplicate nodes, mainly part of the PubMed collaboration network, analyzed in Appendix A. The PubMed data are released in independent issues, so to build a connected network form the PubMed data, we select issues that contain ambiguous and duplicate nodes. We then select the largest connected component of that network. One main limitation to this dataset is that not every author has an associated Orcid ID, which affects the false positive and false negative labels in the network (author names that might be ambiguous would be ignored). This is further highlighted in the subsequent sections.

Node Disambiguation
In this section, we investigate the following questions: (Q 1 ) Quantitatively, how does our method perform in identifying ambiguous nodes compared to the state-of-the-art and other heuristics? (Section 4.2.2); (Q 2 ) Qualitatively, how reliable is the quality of the detected ambiguous nodes compared to other methods when applied to real world datasets? (Section 4.

Data Processing
Before conducting the experiments, the processing of the data to generate semisynthetic networks was needed. This was completed by contracting each of the thirteen datasets mentioned in Table 2. More specifically, for each network G = (V, E), a graph contraction was performed to create a contracted graphĜ = (V,Ê) (ambiguous) by randomly merging a fraction r of total number of nodes, to create a ground truth to test our proposed method. This is completed by first specifying the fraction of the nodes in the graph to be contracted (r ∈ {0.001, 0.01, 0.1}), and then sampling two sets of vertices, V i ⊂V andV j ⊂V, such that |V i | = |V j | = r · |V| andV i ∩V j = ∅. Then, every vertex v j ∈V j is merged with the corresponding vertex v i ∈V i by reassigning the links connected to v j to v i and removing v j from the network. The node-pairs (v i , v j ) later serve as ground truths. We have also tested the case where the set of the candidate contracted vertices have no common neighbors (instead of a uniform selection at random). This mimics some types of social networks where two authors that share the same name, often their ego-networks do not intersect. Further analysis of the PubMed dataset Table 1, revealed that none of the ambiguous nodes shared edges with the neighbors of another ambiguous node.
We have tested the performance of FONDUE-NDA, as well as that of the competing methods listed in the following section, on fourteen different datasets listed in Table 1, with their properties shown in Table 2.

FB-SC
Facebook Social Circles network [29] Consists of anonymized friends list from Facebook.

FB-PP
Page-Page graph of verified Facebook pages [29]. Nodes represent official Facebook pages while the links are mutual likes between pages.

email
Anonymized network generated using email data from a large European research institution modeling the incoming and outgoing email exchange between its members [29].

STD
A database network of the Computer Science department of the University of Antwerp that represent the connections between students, professors and courses [30].

PPI
A subnetwork of the BioGRID Interaction Database [31], that uses PPI network for Homo Sapiens.

lesmis
A network depicting the coappearance of characters in the novel Les Miserables [32].

netscience
A coauthorship network of scientists working on network theory and experiments [29].

polbooks
Network of books about US politics, with edges between books representing frequent copurchasing of books by the same buyers. http://www-personal.umich.edu/~mejn/netdata/, accessed on 20 October 2021

CondMat
Collaboration network of Arxiv Condensed Matter Physics [33] GrQc Collaboration network of Arxiv General Relativity [33] HepTh Collaboration network of Arxiv Theoretical High Energy Physics [33] CM03

Quantitative Evaluation of Node Identification
In this section, we focus on answering Q 1 , namely, given a contracted graph, FONDUE-NDA aims to identify the list of contracted (ambiguous) nodes present in it. We first discuss the datasets used in the experiments in the following section.
Baselines. As mentioned earlier in Section 1, most entity disambiguation methods in the literature focus on the task of re-assigning the edges of an already predefined set of ambiguous nodes, and the process of identifying these nodes in a given non-attributed network, is usually overlooked. Thus, there exists very few approaches that tackle the latter case. In this section, we compare FONDUE-NDA with three different competing approaches that focus on the identification task, one existing method, and two heuristics.
Normalized-Cut (NC) The work of [16] comes close to ours, as their method also aims to identify ambiguous nodes in a given graph by utilizing Markov clustering to cluster an ego network of a vertex u with the vertex itself removed. NC favors the grouping that gives small cross-edges between different clusters of u's neighbors. The result is a score reflecting the quality of the clustering, using normalized-cut (NC): with W(C i , C i ) as the sum of all the edges within cluster C i , W(C i , C i ) the sum of the for all the edges between cluster C i and the rest of the network C i , and k being the number of clusters in the graph. Although [17] also worked on identifying nodes based on topological features, their method (which is not publicly available) performed worse in all the cases when compared to [16], so we only chose the latter as a competing baseline. Connected-Component Score (CC) We also include another baseline, connected-component score (CC), relying on the same approach used in [16], with a slight modification. Instead of computing the normalized cut score based on the clusters of the ego graph of a node, we account for the number of connected components of a node's ego graph, with the node itself removed.
Degree Finally, we use node degree as a baseline. As contracted nodes usually tend to have a higher degree, by inheriting more edges from combined nodes, degree is a simple predictor for the node ambiguity.
Evaluation Metric. FONDUE-NDA ranks nodes according to their calculated ambiguity score (how likely is that node to be ambiguous). The same process goes for NC and CC. At first glance, the evaluation can be approached from a binary classification perspective, by considering the top X ranked nodes as ambiguous (where X is the actual number truepositive), and, thus, we can use the usual metrics for binary classification, such as F1-score, precision, recall and AUC. However, this requires knowing beforehand the number of true-positive, i.e., the number of actual ambiguous nodes (or setting a clear cutoff value), which is only possible in labeled datasets and controlled experiments. In real world settings, if FONDUE-NDA is to be used to detect ambiguous nodes in unlabeled networks, practical application is rather more restricted, as it is more useful to have relevant nodes (ambiguous) ranked more highly than non-relevant nodes. Thus, it is necessary to extend the traditional binary classification evaluation methods, that are based on binary relevance judgments, to more flexible graded relevance judgments, such as, for example, cumulative gain, which is a form of graded precision, as it is identical to the precision when rating scale is binary. However, as our datasets are highly imbalanced by nature, mainly because ambiguous nodes are by definition a small part of the network, a better take on the cumulative gain metric is needed. Hence, we employ the normalized discounted gain to evaluate our method, alongside the traditional binary classification methods listed above. Below, we detail each metric. TP TP + FN F1-score It is the weighted average of the precision where an F1 score reaches its best value at 1 and worst score at 0. F1 = 2 * Recall × Precision Recall + Precision Note that, due to the fact that in the binary classification case, the number of false positive is equal to the number of false negative, the value of the recall, precision and F1-score will be the same.
Area Under the ROC curve (AUC) A ROC curve is a 2D depiction of a classifier performance, which could be reduced to a single scalar value, by calculating the value under the curve (AUC). Essentially, the AUC computes the probability that our measure would rank a randomly chosen ambiguous node (positive example), higher than a randomly chosen non-ambiguous node (negative example). Ideally, this probability value is 1, which means our method has successfully identified ambiguous nodes 100% of the time, and the baseline value is 0.5, where the ambiguous and non-ambiguous nodes are indistinguishable. This accuracy measure has been used in other works in this field, including [16], which makes it easier to compare to their work.
Discounted Gain (DCG) The main limitation of the previous method, as we discussed earlier, is inability to account for graded scores, but rather only binary classification. To account for this, we utilize different cumulative gain based methods. Given a search result list, cumulative gain (CG) is the sum of the graded relevance values of all results.
On the other hand, DCG [34] takes position significance into account, and adds a penalty if a highly relevant document is appearing lower in a search result list, as the graded relevance value is logarithmically reduced proportionally to the position of the result. Practically, it is the sum of the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount. The higher the better is the ranking.
Normalized Discounted Gain (NDCG) It is commonly used in the information retrieval field to measure effectiveness of search algorithms, where highly relevant documents being more useful if appearing earlier in search result, and more useful than marginally relevant documents which are better than non-relevant documents. It improves upon DCG by accounting for the variation of the relevance, and providing a proper upper and lower bounds to be averaged across all the relevance scores. Thus, it is computed by summing the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount, then dividing by the best possible score ideal DCG (IDCG, obtained for a perfect ranking) to obtain a score between 0 and 1.

NDCG = NDCG IDCG
Evaluation pipeline. We first perform network contraction on the original graph, by fixing the ratio of ambiguous nodes to r. We then embed the network using CNE, and compute the disambiguation measure of FONDUE-NDA (Equation (7)), as well as the baseline measures for each node. Then, the scores yield by the measures are compare to the ground truth (i.e., binary labels indicating whether a node is a contracted node). This is completed for 3 different values of r ∈ {0.001, 0.01, 0.1}. We repeat the processes 10 times using a different random seed to generate the contracted network and average the scores. For the embedding configurations, we set the parameters for CNE to σ 1 = 1, σ 2 = 2, with dimensionality limited to d = 8.

Results
. are illustrated in Figure 3 and shown in detail in Table 3 focusing on NDCG mainly for being a better measure for assessing the ranking performance of each method. FONDUE-NDA outperforms the state-of-the-art method, as well as non-trivial baselines in terms of NDCG in most datasets. It is also more robust with the variation of the size of the network, and the fraction of the ambiguous nodes in the graph. NC seems to struggle to identify ambiguous nodes for smaller networks (Table 2). Additionally, as we tested against multiple network settings, with randomly uniform contraction (randomly selecting a node-pair and merging them together), or a conditional contraction (selecting a node pair that do not share common neighbors to mimic realistically collaboration networks), we did not observe any significant changes in the results.

Qualitative Evaluation of Nodes Identification
Now that we have demonstrated that FONDUE-NDA outperforms current state-ofthe-art-methods for identifying ambiguous nodes, in this section we investigate Q 3 , the quality of the results produced by FONDUE-NDA and how do they compare to other competing methods. Additionally, we investigate the question, whether nodes with a higher ambiguity number (nodes that map to a larger number of entities) are more easily identified as ambiguous compared to nodes with lower ambiguity number (that map to a lower number of entities). To do so, semi-synthetic datasets do not constitute valid sources, and we need to rely on real-world dataset to assess the quality of the ground truth of the ambiguous nodes. We collected data from the National Center for Biotechnology Information that provides the PubMed datasets, that comprises citations for biomedical literature from MEDLINE, life science journals, and online books, that may include links to full-text content from PubMed Central and publisher websites. Snippets of those data are released periodically and can be used to build author-author collaboration networks. Us-ing Orcid ID, a persistent digital identifier, usually provided in the metadata of the datasets to distinguish ambiguous nodes, we can build reliable ground truth for our qualitative experiment and investigate the top ranked nodes by FONDUE-NDA, and other baselines.
Datasets. PubMed datasets are publicly available, and updated regularly. Few data snippets are selected to extract author information and build an author-author collaboration network based on the metadata included in the datasets. The biggest connected component of the network is then selected. To assess the ambiguity of the network, we also extract the Orcid ID, if available, for each author in the network. If more than Ocrid iD is associated with one author, the author-name is labeled as ambiguous. Note that not all authors have their Orcid ID listed in the datasets, which might cause a possible degradation in the quality of the labeled datasets, as some author-names might be ambiguous, but due to their Orcid ID not being recorded, they will not be accounted for in the labels, thus affecting the final results.

Metrics.
Similarly to Section 4.2.2, we first take the binary classification approach, where we consider the top 31 ranked nodes to be classified as ambiguous (as the network contains 31 ambiguous nodes). The same goes for NC. We use the same metrics from Section 4.2.2. Additionally, to assess the quality of the ranked nodes, we utilize the number of true positives (TP) (nodes that are correctly identified as ambiguous) and false positive (FP) (nodes that are incorrectly identified as ambiguous). We also consider NDCG for graded relevance.
Results. After building the PubMed collaboration network, that contains 2122 nodes with 31 ambiguous nodes (6 of which are names that refer to more than 2 authors), it is embedded using CNE, and we compute the score measure of FONDUE-NDA (Equation (7)), as well as the baseline measures for each node. As explained earlier in Section 4.2.2, FONDUE-NDA ranks the nodes by ambiguity score. For comparison, we use the work of [16] as a baseline for qualitative comparison. Table 4 shows the performance of FONDUE-NDA against NC, for AUC, TP, FP, and F1-score. FONDUE-NDA clearly outperforms NC for binary classification of ambiguous nodes. This result is also highlighted when inspecting further the results in Table 5, that list the top 10 classified nodes by each method, with FONDUE-NDA successfully classifying 90% of the author names. Note that the results are quite intuitive, and conform with our findings in the earlier analysis of the dataset in Appendix A, as authors with Asian names are more likely to share common names, due to shorter name length, and simplified transcription (from Mandarin to English for example).
We also investigated the ranks of nodes that maps to more than 2 entities (highlighted with an asterisk in both tables). Again FONDUE-NDA outperforms NC, and ranks 3 out of 6 highly ambiguous names in the top 10.
This confirms the results obtained in Section 4.2.2, as FONDUE-NDA outperforms NC in as a state-of-the-art method for identifying ambiguous nodes. Table 4. Performance of FONDUE-NDA compared to NC for the PubMed dataset containing 31 ambiguous author-names of which 6 are associated with more than 2 Orcid IDs (highly ambiguous). NDCG * reflects the ranking score of those highly ambiguous nodes.

Quantitative Evaluation of Nodes Splitting
Following the identification of the ambiguous nodes, in this section, we focus on the task of node splitting, and answering Q 3 of how well does FONDUE-NDA perform when it comes to partitioning the set of edges into two separate ambiguous nodes. Simply put, given an ambiguous node v i , we refer to node splitting the process of replacing this particular node with two different nodes v i , v i and re-assigning the edges of v i , such that Baselines. For the node splitting task, the three baselines previously discussed in Section 4.2.2 are not immediately applicable. However, we adopt the Markov clustering (MCL) approach utilized in normalized cut measure for splitting. Namely, a splitting is given by the MCL clustering on the ego network of an ambiguous node, with the node itself removed.
Evaluation Metric. Given a list of ambiguous nodes, we evaluate the splitting given by FONDUE-NDA and MCL against the ground truth (node splitting according to the original network). This is quantified by computing the adjusted Rand index (ARI) score between FONDUE-NDA and the ground truth, as well as between MCL and the ground truth. ARI score is a similarity measure between two clusterings. ARI ranges between −1 and 1, the higher the score the better the alignment between the two compared clusterings.
Pipeline. First we compute the ground truth. Then, for each ambiguous node, we evaluate the quality (based on ARI) of the split from FONDUE-NDA and MCL compared to the original partition. We repeat the experiments for three different contraction ratios r ∈ {0.001, 0.01, 0.1} for each dataset. For each ratio, the experiment is repeated 3 times with different seeds. Results. Despite outperforming NC in ambiguous node identification, FONDUE-NDA seems to underperform compared to MCL on nearly every dataset (Table 6). This shows that for node splitting, FONDUE-NDA is not adequately optimized for such task. Many factors contribute to the poor performance, mainly the quality of the embeddings which would lead to a poor performance for the objective function for edge assignment. Nonetheless, given the modular approach for our method, and for a complete framework for node identification and node splitting, we recommend using FONDUE-NDA for former task and MCL for the latter.

Parameter Sensitivity
In this section, we study the robustness of our FONDUE-NDA against different network settings. Mainly, how does the percentage of ambiguous nodes in a graph affect the node identification. In the previous experiment (Section 4.2.2), we have fixed the ratio of ambiguous nodes to {0.001, 0.01, 0.10}, we follow the same pipeline (generate, embed, evaluate for 10 different random seeds), for different ratios of ambiguous nodes. As listed in Table 3 FONDUE-NDA outperforms MCL and other baselines across nearly all networks with different contraction ratios. We also accounted for different ways the node contraction is conducted as specified in Section 4.2.1. As previously mentioned, the node contraction process is assumed to be a form of corruption of the data, i.e., in real-life (outside an evaluation setup). To simulate this corruption process (so as to generate semi-synthetic test data with known ground-truth), merged nodes were selected uniformly at random. Even though this does not guarantee that the network becomes unnatural in a way that suits the embedding objective function. However, this further strengthens our empirical results: FONDUE-NDA works even when the assumptions are not guaranteed to be satisfied. We also studied different ways to select nodes for contraction in generating semi-synthetic data: only selecting contracted nodes that have common neighbors. FONDUE-NDA consistently outperformed the baselines also with these different network contraction approaches in most datasets, as shown in Table 3 and Figure 3.

Execution Time Analysis
The runtime of FONDUE-NDA is linear O(n) on each node, and is trivially parallelizable. In Figure 4, we show the execution speed of FONDUE-NDA and baselines in node identification and splitting. FONDUE-NDA is faster than CC, and by nearly one order of magnitude than NC in most datasets. Note that FONDUE-NDA approximates Equation (5) by aggregating two different approximation heuristics (i.e., randomized sampling and eigenvector thresholding as listed in Section 3.3.2). Although the best results (in terms of NDCG and ARI) were obtained by the latter approximation. Thus, the runtime results reflect only the execution time of latter heuristic. This is listed in details in Table 7 Table 7, for each of the 3 measures, FONDUE-NDA, NC, CC for different percentage of contracted nodes, 0.1%, 1%, and 10%, respectively. In this section we describe the details of the experimental setup used to tackle the NDD problem. Referring to the data listed in Table 1, we performed our experiments on three semi-synthetic datasets. We also used the PubMed dataset with metadata containing Orcid ID to also build a duplicate network with ground truth derived from the author names with common Orcid ID.
Evaluation Pipeline. As we indicated in Section 4.1, we perform post-processing on the graph data to conduct controlled experiments for comparison with other baselines. For the semi-synthetic datasets (Table 2), to simulate the data corruption process caused by the NDD problem, we perform node splitting on different datasets. For a given graph G = (V, E) with n = |V|, we compute the embedding cost function of G. We then randomly choose one node i to split into two i and i , i.e., adding a new node to the graph, which results in a corrupted (duplicate) graphĜ = (V,Ê) with n d = |V| = n + 1.
We then randomly choose one node-pair fromV ×V, and perform node contraction, by merging these 2 nodes such that we end up with 1 node containing all the edges of the previous 2, which results in contracted graphĜ c = (V c ,Ê c ) with |V c | = n. We then compute the embedding cost function of G c . We repeat the process by choosing a different random node-pair from V c and computing the resulting embedding cost function, 99 times. Lastly, we compute the embedding cost function after merging the duplicate node pair {i , i }. We compare the value of the objective cost function with that of graph G, and display the ranking in Table 8. Although the process of data corruption seems simple, there are many parameter variables at hand that can affect it. Thus, we introduce few parameters to better model node splitting: • Edge distribution: We employ 2 different ways to reassign the edges, from 1 node to 2 different nodes, either by randomly distributing the edges with at least 1 edge per node, or by ensuring equal distribution for each one; • Minimum degree: We only choose to split nodes with a degree larger than a specified minimum; • Overlap: We specify if there is an overlap in the edge reassignment for the different nodes, i.e., percentage of common edges for each node.
Datasets. As indicated earlier, we have used the three semi-synthetic datasets (Table 2), lesmis, polbooks, and netscience, mainly for their relatively small size, additionally we tested FONDUE-NDD on part of the PubMed dataset (described earlier in Section 4.2.3), a network of 2122 nodes, including one duplicate node.

Baselines.
As all competing methods [22,23] require additional labels and attributes for NDD, so we opted to use a simple baseline to assess the performance of FONDUE-NDD, namely the L2-norm of the embedding distance between the candidate duplicate node-pairs {i , i } using CNE (referred to as ED).
Metric. As described in the previous section, we compute the value of the objective function after the re-embedding for the different merged nodes. We then rank the different node pairs by their value. We use that rank as a metric to predict whether our approach can successfully predict which node-pair is a duplicate. The best value would be 1, which means that 100% of the time, FONDUE-NDD is able to identify the node pairs, as the cost of the re-embedding is the lowest. Results. The results in Table 8, represent the average ranking of objective cost function over 100 different trial. We ran a 2-side Fisher test to test if the differences between the averages for the two methods are significantly different (p < 0.05), and the averages are highlight in bold when it is the case. The results show that for high degree nodes (higher than the average), FONDUE-NDD outperforms ED, but its performance degrades for low degree nodes. Additionally, the more connected a corrupted node is, the better the improvement of the objective function of the recovered network compared to that of the of corrupted network. This shows that some parameters identified in the previous section plays a large role in the identification of the duplicate nodes using FONDUE-NDD. Overall the intuition behind FONDUE-NDD is highlighted in the results of the experiments. For the PubMed dataset, we find that the average rank is equal to 4 out of 100, while ED ranked 6th. This also confirms the result to semi-synthetic data, as the degree of the duplicate node was above the average of the graph.
Execution time. As we do not account for the time of embedding of the initial duplicate network as part of execution time for FONDUE-NDD, the baseline ED has an execution time of 0, as it is directly derived from the embedding of the duplicate graph. FONDUE-NDD performs additional repeated uniform random node contraction then embedding, as specified in the pipeline section, thus the execution time for FONDUE-NDD varies depending on the size of the network and the number of embeddings executed. Results are shown in Table 9.

Discussions
Despite its state-of-the-art performance in identifying ambiguous nodes (Section 4.2.2), FONDUE-NDA's node splitting functionality falls short compared to that of MCL (Section 4.2.4). Nonetheless, we argue that FONDUE-NDA's main feature is to facilitate the identification of ambiguous nodes, which is one if the highlight contributions of this paper, as its results are consistent across different datasets and contraction ratio, rendering it a versatile tool for network ambiguity detection in the challenging situation when besides the network topology itself no additional information (such as node attributes, descriptions, or labels) is available or may be used.
For node deduplication, FONDUE-NDA performed well in settings where the duplicate nodes have a higher than average degree compared to the network, which is arguably the case for this NDD, as duplicate nodes tend to have higher degree.
The main limitation of FONDUE is its reliance on the scalability of the embedding method. The current backend NE method being CNE, the scalability is limited to mediumsized networks with sub-100,000 nodes.
Implementing additional NE methods for FONDUE-NDA and FONDUE-NDD could be one future areas for exploring and improving the state-of-the-art of NDA and NDD.

Conclusions
In this paper, we formalized both the node deduplication problem and the node disambiguation problem as inverse problems. We presented FONDUE as a novel method that exploits the empirical fact that naturally occurring networks can be embedded well using state-of-the-art network embedding methods, such that the embedding quality of the network after node disambiguation or node deduplication can be used as an inductive bias.
For node deduplication, we showed that FONDUE-NDD, using only the topological properties of a graph, can help identify nodes that are duplicate, with experiments on four different datasets successfully demonstrating the viability of the method. Despite it not being an end-to-end solution, it can facilitate filtering out the best candidate nodes that are duplicates.
For tackling node disambiguation, FONDUE-NDA decomposes this task into two subtasks: identifying ambiguous nodes, and determining how to optimally split them. Using an extensive experimental pipeline, we empirically demonstrated that FONDUE-NDA outperforms the state-of-the-art when it comes to the accuracy of identifying ambiguous nodes, by a substantial margin and uniformly across a wide range of benchmark datasets of varying size, proportion of ambiguous nodes, and domain, while keeping the computational cost lower than that of the best baseline method, by nearly one order of magnitude.
On the other hand, the boost in ambiguous node identification accuracy was not observed for the node splitting task, where FONDUE-NDA underperformed compared to the competing baseline, Markov clustering. Thus, we suggested a combination of FONDUE for node identification, and Markov clustering on the ego-networks of ambiguous nodes for node splitting, as the most accurate approach to address the full node disambiguation problem.  G091017N, G0F9816N, 3G042220).

Data Availability Statement:
All data used in this study are publicly available, from other sources. See Section 3.1 and Table 1 for sources.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used frequently in this manuscript: NDA Node Disambiguation NDD Node Deduplication NE Network Embeddings CNE Conditional Network Embeddings

Appendix A. Real-World Example: PubMed Dataset
To help illustrate the aforementioned problems in the real-world, we processed the publicly available PubMed dataset courtesy of the U.S. National Library of Medicine, that comprises more than 31 million citations for biomedical literature from various sources, with metadata ranging from author names to their Orcid IDs. An Orcid ID is a unique author identifier used to distinguish researchers. It is useful information to uncover ambiguous author names (an author name that has multiple Orcid IDs) or author name duplicates (distinct names that correspond to the same Orcid ID). From a quick analysis of the 17 millions author data present in the dataset, there exist 154,247 duplicate author names (different names that refer to the same author), and 62,506 ambiguous author names (different authors with the same name). A further visualization in Figure A1 shows the frequency histogram of ambiguous author names with multiple Orcid IDs which ranges from 2 up to more than 200 per name. The most common cause for this problem is that some distinct authors share the same name, which is very common for names for authors of Asian background, as the romanization of their name is often ambiguous. The most ambiguous names in the dataset are 'Wei Zhang' and 'Wei Wang', both with more than 200 different authors associated with each of them. Figure A1. The frequency of ambiguous author names with multiple Orcid IDs. It represents the frequency of names that are shared with distinct authors. 'Wei Zhang' and 'Wei Wang' are the most common names, being shared with over than 200 distinct authors each.
As for duplicate author names, shown in Figure A2, the main cause is multiple name variations for the same author, for example, the Orcid ID 0000000240600292 belongs to 4 variations of the name 'Robert Henry' (Robert J. Henry, Robert James Henry, and R. Henry) that refer to the same author in the dataset. The second cause is erroneously parsed metadata provided by the PubMed baseline dataset, where the coauthors share the same Orcid ID as one of the other authors. Figure A2. The frequency of different author names that share the same Orcid ID. It represents Orcid IDs that are present in more than 1 author name entry, due to author name variation.
Both of those problems are the result of both the processing and parsing tasks, which are very common for such data type.

Appendix B. Tabulated Results for Ambiguous Node Identification
As discussed in Section 4.2.2, we argued that the best metric for evaluating the task was NDCG. Nonetheless, as many approaches use the classical binary classification metrics, we present below detailed results of the experiments. Table A1. Performance evaluation for AUC score, DCG and F1-score on multiple datasets for our FONDUE-NDA compared with other baselines. Note that for some datasets with a small number of nodes, we did not perform any contraction for 0.001 as the number of contracted nodes in this case is very small, thus we replaced the values for those methods by "−".