Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base

: Entity alignment (EA) aims to automatically determine whether an entity pair in different knowledge bases or knowledge graphs refer to the same entity in reality. Inspired by human cognitive mechanisms, we propose a coarse-to-ﬁne entity alignment model (called CFEA) consisting of three stages: coarse-grained, middle-grained, and ﬁne-grained. In the coarse-grained stage, a pruning strategy based on the restriction of entity types is adopted to reduce the number of candidate matching entities. The goal of this stage is to ﬁlter out pairs of entities that are clearly not the same entity. In the middle-grained stage, we calculate the similarity of entity pairs through some key attribute values and matched attribute values, the goal of which is to identify the entity pairs that are obviously not the same entity or are obviously the same entity. After this step, the number of candidate entity pairs is further reduced. In the ﬁne-grained stage, contextual information, such as abstract and description text, is considered, and topic modeling is carried out to achieve more accurate matching. The basic idea of this stage is to use more information to help judge entity pairs that are difﬁcult to distinguish using basic information from the ﬁrst two stages. The experimental results on real-world datasets verify the effectiveness of our model compared with baselines.

There is an urgent need to effectively integrate knowledge data from multiple heterogeneous data sources, and to finally form a large encyclopedic knowledge base with a clear logical structure and complete contents. Therefore, it is extremely significant to be able to associate various entities from different data sources with the same entity in the real world, that is, the task of entity alignment [13,14].
Current entity alignment methods are mainly divided into two categories [13,14], embedding-based methods and symbol-based methods. The advantage of embeddingbased methods is that they are computationally efficient, because the similarity of two entities is calculated by the distance between their vectors. However, current embeddingbased methods require a large amount of labeled data to train the model. The advantage of the symbol-based method is that it does not rely on label data and has high accuracy, but it has a high complexity and is time consuming, because the similarities of entity pairs composed of every two entities need to be calculated. Moreover, since the same entity may have different names, judging two entities based only on symbolic similarity will result in a low recall rate.
In this paper, we propose a three-stage model, inspired by the human brain's "largescale-first" cognitive mechanism (from coarse-grained to fine-grained), for aligning heterogeneous Chinese encyclopedia knowledge bases. The model is named coarse-to-fine entity alignment (CFEA). The experimental results on real Chinese encyclopedia knowledge graph data show the effectiveness of our method. Our major contributions are as follows. • We design a three-stage unsupervised entity alignment algorithm from the perspective of the human "large-scale-first" cognitive mechanism. It gradually prunes and matches candidate entity pairs from coarse-grained to fine-grained. We uses simple methods in the first and second stages, which greatly reduces the number of entity pairs that need to be compared, thereby improving the efficiency and accuracy of the algorithm through pruning. • We combine different types of information to improve performance. That is, in three different stages of the model, three aspects of information are used, including entity type information, attribute information, and text information. Ordered by the difficulty in identifying entity pairs, the amount of information used and the complexity of the model gradually increase from the first stage through the third stage. • We present an experimental study of a real Chinese encyclopedia dataset containing entity attributes and contexts. The experimental results demonstrate that our algorithm outperforms baseline methods.
The rest of the paper is structured as follows. Section 2 summarizes the related work. Section 3 explains our proposed CFEA model. Section 4 presents the experimental result and analysis. Section 5 concludes this paper.

Related Work
In this section, we first analyze CFEA and its related entity alignment models from two perspectives. We categorize them from the perspectives of informational scope and method of analysis, which are explained in Sections 2.1 and 2.2, respectively. After that, we analyze the relevant EA model for a Chinese encyclopedia knowledge base in Section 2.3.

Collective Alignment
From the perspective of the scope of information to be considered, EA methods can be divided into pairs-wise alignment, local collective alignment, and global collective alignment [15].
The paired-entity alignment method is a method of comparing the attribute information of the current matching entities in pairs. This method considers the similarity of entity attributes or specific attribute information, but does not consider the relationships between matching entities. Newcombe [16] and Fellegi [17] established a probabilistic model of the entity-matching problem, based on attribute similarity scores, by converting this problem into a classification problem,which is divided into matching, possible matching, and non-matching. This model is an important method of entity alignment. An intuitive classification method is to add the similarity scores of all attributes to obtain the total similarity score, and then set two similarity thresholds to determine which similarity interval the total similarity score sim sum attr is within. This can be expressed as:    u ≤ sim sum attr (e 1 , e 2 ) ⇒ e 1 , e 2 matched v ≤ sim sum attr (e 1 , e 2 ) < u ⇒ e 1 , e 2 possible matched sim sum attr (e 1 , e 2 ) < v ⇒ e 1 , e 2 mismatched Here, e 1 , e 2 are the entity pairs to be matched; u and v are the upper and lower thresholds; sim sum attr firstly calculates matched pairs of attributes from e 1 , e 2 respectively and then sums them to get a total similarity score. The main problem with this simple method is that it does not reflect the influence of different attributes on the final similarity. An important solution is to assign different weights to each matched attribute to reflect its importance to the alignment result. Herzog described this idea formally by establishing a probability-based entity link model based on the Fellegi-Sunter model [18]. The paired entity alignment method is simple to operate and convenient to calculate, but it has the disadvantage of considering less information and lacking semantic information. The middle-grained part in the CFEA model proposed in this paper is also constructed based on the classification idea and on the distribution weight idea, which belongs to the pairedentity alignment method.
Local collective entity alignment not only considers the attribute similarity of matching entity pairs, but also considers the similarity of their relationships. One method is to assign different weights to the attributes of the entity itself and the attributes of its related neighbor entities, and to calculate the overall similarity by weighted summation. This method can be formalized as: and sim attr (e 1 , e 2 ) = ∑ (a 1 ,a 2 )∈attr(e 1 ,e 2 ) sim(a 1 , a 2 ); sim NB (e 1 , e 2 ) = ∑ (e 1 ,e 2 )∈NB(e 1 ,e 2 ) sim attr (e 1 , e 2 ).
where sim attr (e 1 , e 2 ) is the attribute similarity function of the entity pair; sim NB (e 1 , e 2 ) is the degree of neighbor similarity in the entity pair function; 0 ≤ α ≤ 1 is the adjustment parameter of the above two functions. This method takes the relationship of the entity as a special attribute of the entity and brings it into the calculation. In essence, it is still a paired-entity alignment method. The local collective entity alignment method does not really employ a "collective" approach. The global collective entity alignment is a method of entity alignment that truly realizes the collective approach. One way is collective alignment based on similarity propagation. This method iteratively generates new matches [19][20][21] in a "bootstrapping" manner through initial matching. For example, two similarly named authors who have a "coauthor" relationship with two aligned author entities may have a higher degree of similarity, and then this similarity is propagated to more entities. The other way is based on a probability model. This type of method establishes a complex probability model for entities with which to match relationships and match decision-making. Generally, statistical relational learning is used for calculation and reasoning, and methods such as logical representation, probabilistic reasoning, uncertainty processing, machine learning, and data mining are integrated with relations to obtain the likelihood model in relational data. Bayesian network models [22,23], LDA models [24,25], CRF (conditional random field) models [26,27] and MLN (Markov logic networks) [28,29] models are commonly used probability models. Probabilistic models provide a standard method of relationship modeling, which can effectively improve the matching effort. Among them, the LDA model is an unsupervised model that mines the potential topic information of the text. It can simplify the data representation in a large-scale data set and retain basic information for data analysis, such as correlation, similarity, or clustering. It has the advantages of wide applicability and adaptability to large-scale data sets.
Among these probabilistic models, latent Dirichlet allocation (LDA) [30] is a three-layer Bayesian probabilistic model that consists of three layers: words, topics, and documents. It utilizes priori parameters α, β, the corpus document, D, and the parameters θ d , φ k to automatically generate the document and obtain the the posteriori P(θ, φ|D) by learning the corpus document. The derived parameters θ d , φ k can determine the topic of the document at the semantic level. Figure 1 shows the Bayesian network diagram of the process of document generation with the LDA model. The corresponding process of document generation is as follows.

1.
Sampling in the distribution Dirichlet( α) to get the topic distribution θ d of document d.

2.
Sampling in the distribution Multinomial(θ d ) to get the topic z n of word n in document d.

3.
Sampling in the distribution Dirichlet( β) to get the word distribution φ k of topic d (denoted as k).

4.
Sampling in the distribution Multinomial(φ k ) to get the word w n Finally, the document is generated after the above multiple sampling. The CFEA (coarse-to-fine entity alignment) model proposed in this paper combines paired entity alignment and collective entity alignment. The middle-grained model (detail in Section 3.4) only considers the attribute information of the entity pair currently to be matched, thus, it is a paired-entity alignment method. The fine-grained model (detail in Section 3.5) mainly uses the LDA method. The LDA topic model obtains additional entity information that cannot be obtained from attribute information from the semantic level of the text to help align entities. This is a global collective entity alignment method. It performs topic modeling on all texts in the data set, and there is a topic-level influence between any two entities.

Similarity-Based Alignment
From the perspective of information analysis methods, entity alignment methods could be classified as follows: based on network ontology semantics [31], based on rule analysis and based on similarity theory judgment.
Entity alignment based on web ontology semantics uses the web ontology language (OWL) to describe the ontology and to perform reasoning and alignment. This type of method combines the semantic information of OWL in the knowledge base, such as the heuristic reasoning algorithm [32], iterative discriminant model [33], to reason and judge whether it can be aligned. However, for Chinese encyclopedia knowledge bases, such as Baidu Baike and Hudong Baike, do not contain complete ontology information, so it is difficult to align them according to OWL semantics.
The entity alignment method based on rule analysis uses the rule-based evaluation function to judge and align by formulating special rules in specific application scenarios. However, the method based on rule analysis is not universal. It is possible to perform alignment based on rule analysis only by establishing a large number of different rules in many specific areas. Similarly, Chinese Web Encyclopedia contains a large amount of entity neighborhood information, and it is difficult to align it by establishing a large number of different rules for specific fields.
Therefore, in the face of Chinese encyclopedia knowledge bases that do not have complete ontology information and that contain a large number of subject areas, more research is focused on the method of similarity theory determination. The method of judging based on the similarity theory is widely used. This method considers the attribute value information or description text information of the entity, and calculates the similarity of such information for alignment. Some research has used similarity-based fuzzy matching, topic models, etc. to align the entities of the encyclopedia and knowledge base. Among the methods of attribute similarity calculation, there are mainly token-based similarity calculations, LCS-based attribute similarity calculations, and edit distance-based similarity calculations. • Token-based similarity calculation. This method uses a certain function to convert the text string to be matched into a collection of substrings, called collections of tokens, and then uses the similarity function to calculate the similarity of the two token sets to be matched, such as the Jaccard similarity function, cosine similarity function, and the qgram similarity function, etc. Different similarity functions have different characteristics. • The LCS-based similarity calculation. This method finds the longest common subsequence in two text sequences, and the length ratio of the original sequence is used as the similarity. • Edit distance-based similarity calculation. This method regards the text string to be matched as a whole, and the minimum cost of the editing operation required to convert one string to another string is used as a measure of the similarity of two strings. Basic editing operations include inserting, deleting, replacing, and swapping positions.
The middle-grained model of the CFEA model we propose uses similarity calculation based on edit distance. The similarity calculation based on edit distance can effectively deal with sensitive issues such as input errors, which is of great significance to the network encyclopedia data that belongs to the user's original content UGC. Commonly used similarity functions based on edit distance are based on Levenshtein distance [34], Smith-Waterman distance [35], affine gap distance [36], or Jaro and Jaro-Winkler distance [37,38]. The middle-grained model will use the based on Levenshtein distance similarity function, which will be introduced in detail in Section 3.4.

Alignment for Chinese Encyclopedia
The main method of existing Chinese encyclopedia knowledge base entity alignment is based on the calculation of attribute information similarity. The paper [39,40] only uses the attribute value of the entity to calculate the similarity of the entity and determine whether the entity can be aligned. The comprehensive index is poor due to the irregularity of attribute names and attribute values in the web encyclopedia.
In addition, the similarity calculation of contextual information composed of entity abstracts and descriptive texts is also an important method in the alignment of encyclopedia knowledge base entities. On the one hand, it is necessary to use additional text information as a supplement because of the heterogeneity and ambiguity of structured attribute information in different knowledge bases; on the other hand, using a large number of unstructured entity abstracts or descriptive text information in the knowledge base, we can obtain latent semantics and relationships and construct entity features, which can be helpful in further alignment. One way to consider contextual information is to use TF-IDF [39] for text modeling. However, traditional text modeling methods, such as TF-IDF, only consider the characteristics of word frequency, and do not consider the semantic relationship between terms. The LDA topic model can mine the underlying topic semantics of the text unsupervisedly. The model in [41] combined LCS and LDA, and achieved a high accuracy rate. However, the model lacked the granularity of a classification/partition index, and entailed no coarse-to-fine process. The text co-training model [42] use the semi-supervised method to learn the multiple features of entities. The paper [43] improved the LDA model, the comprehensive index improved under some datasets. However, it is difficult for a single LDA model to achieve a high accuracy rate on a wider dataset, and the time consumption of LDA similarity calculation and matching is relatively high. WA-Word2Vec [44] use LTP (language technology platform) to identity the named entities in the text. Then, they deal word vectors and obtain feature vectors by Word2Vec. The latest method is MED-Doc2Vec [45]. It used minimum edit distance and the Doc2vec model to obtain feature vectors containing semantic information. This model obtained comprehensive entity similarity by weighted average to complete the entity-alignment task and achieved better results in the experiment.
In summary, the entity alignment method currently applied to the Chinese Encyclopedia knowledge base is struggles to achieve a balance between accuracy and time. Therefore, we propose combining the edit distance-based method with the LDA topic modeling method, and apply the coarse-to-fine idea to the unsupervised Chinese encyclopedia entity-alignment task.

CFEA Model
In this section, we discuss some of the definitions and task descriptions in this paper in Section 3.1. Then, the framework of our model is described in Section 3.2. Finally, the three stages of the CFEA model are detailed in Sections 3.3-3.5, respectively.

Preliminaries and Task Description
In this subsection, we first discuss the definition of a heterogeneous encyclopedia knowledge base with the task description of entity alignment. Then, the notations used are listed.

Definition 1.
Heterogeneous encyclopedia knowledge base. The heterogeneous encyclopedia knowledge base comes from encyclopedia websites and contains the semi-structured attribute of entities and unstructured contextual information. Among them, the latter consists of abstracts and descriptive text. These types of information are diverse in structure and ambiguous in content in different knowledge bases. For instance, Baidu Baike, Hudong Baike, and Chinese Wikipedia are quite distinct in the structure and content of web pages. Definition 2. Entity Alignment. Entity Alignment (EA, also called entity matching) aims to identify entities located in different knowledge bases that refer to the same real-world object. The illustration of entity alignment for knowledge base is shown in Figure 2.
Illustration of knowledge base entity alignment (reproduced from [46]). KB 1 and KB 2 are two different knowledge bases, and Entity1 and Entity2 are the entities in the two knowledge bases, respectively. Both entities refer to the same entity Entity0 in the real world.
The notations used in this paper are listed in Table 1 in order of their appearance.  The steps of the CFEA model as applied to a heterogeneous encyclopedia knowledge base are as follows. (1) We first extract the attributes, abstracts, and descriptions of the entities in the dataset, and combine the latter two as contextual information. (2) Secondly, we block the knowledge base and build the topic feature base. The blocking algorithm refers to grouping entities that meet the attribute requirements into the corresponding categories. It is the coarse-grained part of the CFEA. The construction of the topic feature base requires all contexts to be processed by Chinese word separation and input to the LDA model for unsupervised training, which, in turn, yields the latent topic features of the entity. (3) After that, it is the process of aligning the entities. For pairs of entities to be aligned from different data sources, we use the method of the edit distance-based similarity calculation of attributes as the middle-grained part for preliminary judgment. If the similarity score meets the upper boundary of the threshold, the pair of entities is directly judged as aligned. If beneath the lower threshold, pairs are determined as non-aligned. Otherwise, it is between the upper and lower threshold, which means that the current attribute information is insufficient for judgment. (4) At last, we explore the fine-grained perspective, with the well-trained LDA topic model. We further consider the contextual information of the current entities and perform the calculation of topic similarity. After the above steps, all matching entities are generated in a coarse-to-fine process.
The process of coarse-to-fine entity alignment, described in Algorithm 1. Human cognition is a process of refinement from coarse-grained perspectives to a fine-grained perspectives [47]. Most existing collective entity alignment methods have high time complexity. Therefore, we introduce the human brain's "global precedence" cognitive mechanism. We first obtain the category to which each entity belongs as the coarse-grained step and weed out candidates with different types in the following process of entity alignment. As these parts of the entities cannot reach a match with other types of entities, in this way, our approach achieves a reduction in runtime by reducing the size of the candidate knowledge base.

Coarse-Grained Blocking Algorithm
For instance, the search for the keyword "Milan" in Chinese on the website of Baidu Baike yields 21 meanings. As in Figure 4, the diagram shows some of the meanings, such as a city in Italy, the Milan soccer club, and an orchid plant. These meanings can be classified as city names, names of people, names of books or films, names of plants, names of sports clubs, etc. When the current entity named "Milan" is judged by its attributes to be a person's name rather than a city's name or something else, we only need to look for candidates among the category of person entities. As a result, the number of candidate entities is significantly reduced from 21 to a single-digit number.  In the field of database technology, the method of pre-processing before entity matching is called "blocking and index techniques" [48]. Blocking and index techniques assign entities with certain similar characteristics to the same block, so that entity matching is done only within the same block. This technique, which comes from the database domain, can be applied to the entity alignment for the knowledge base.
Our method requires setting a "preset attribute list". As shown in Figure 5a, it is a twodimensional list, in which the first dimension contains several categories that we include in the reference and the second dimension contains some names of feature attributes we set for that category. The preset attribute list is versatile and flexible for blocking the encyclopedic knowledge base, which usually contains many kinds of entities. Based on the official category indexes of several encyclopedia sites, and removing categories with little difference or containment-inclusion relationships, we get the preset attribute list after collating them. This is generally applicable to other Chinese encyclopedic knowledge bases as well.
In addition, the list can be extended somewhat to improve the efficiency and accuracy of blocking. If the two-dimensional list is extended to a multi-dimensional list, i.e., there are small categories in a large classification, and the scope and accuracy of the classification is flexibly controlled at multiple levels. Figure 5b is an example of coarse-grained category in a three-dimensional list. It allows more flexibility to control whether the matching range is coarse or fine, and whether the entity matching speed is fast or slow, and to find a balance when performing matching between entities.

Middle-Grained Attribute Similarity
The attributes of entities in the knowledge base are the fundamental information for the study of alignment. However, there are two types of problems with the attributes in the Chinese web encyclopedia: on the one hand, the heterogeneous encyclopedic knowledge base contains voluminous contents of subject areas and does not have complete ontological information; on the other hand, the web data of user-generated content (UGC) introduces the problem of ambiguity.
For the former problem, the similarity-based approach allows distinguishing the entities in the encyclopedic knowledge base, especially those with more accurate attribute definitions. For the latter problem, we use an edit distance-based approach to calculate the similarity. The reason for this is that an edit distance-based approach can effectively deal with error sensitivity issues, such as entry errors. This is meaningful for web encyclopedia data that belongs to UGC. Therefore, we propose an attribute similarity calculation method suitable for encyclopedic knowledge bases based on the edit distance algorithm described in [34].
Specifically, the Levenshtein distance is a typical method of edit distance calculation that measures the similarity of strings. Edit distance is the minimum number of operations to convert the string str1 to str2. The operations for characters include insertion, deletion, and replacement. The smaller the distance, the more similar the two strings are. For two strings S1 and S2, the Levenshtein distance calculation formula [34] is as follows.
where S1[i]andS2[j] are the i th character of string S1 and the j th character of string S2, respectively. From the above, we apply edit distance to attribute name alignment and attribute value similarity calculation, respectively. Different knowledge bases vary not only in describing textual information and semi-structured attribute's values, but also in what attributes they contain and which attribute's names they call. The first use of edit distance-based similarity calculation in our CFEA model is to match the attribute's names of different entities, i.e., attribute alignment. Only attributes that reach a certain threshold will be matched for the next values alignment, otherwise they will be treated as invalid information. The second use of edit distance calculation is to perform a similarity calculation on the attribute values of the entity pairs to be matched. The detailed calculation is shown below.

1.
Attributes of encyclopedia knowledge base Definition 3. After the preprocessing of the dataset, entity e a has the set of attribute names, that is, Attribute a = p a1 , p a2 , p a3 , ..., p am . The corresponding set of attribute values is Value a = v a1 , v a2 , v a3 , ..., v am . Entity e b 's set of attribute names is Attribute b = p b1 , p b2 , p b3 , ..., p bn . Its set of attribute's values is Value b = v b1 , v b2 , v b3 , ..., v bm . where m, n are the number of attributes of the entity e a , e b , respectively.

2.
The conditions for matching attribute Attribute a and Attribute b where p a and p b are the attribute's names from Attribute a and Attribute b , respectively, and c is the minimum threshold for an attribute name to match.

3.
For the matched attribute names p i ∈ interAttribute(e a , e b ), the similarity of attribute p i is calculated by where v ai , v bi are the attribute values corresponding to the attribute names p i in Value a andValue b , respectively. 4.
The attribute similarities between the entities e a and e b are calculated as where T = len(interAttribute(e a , e b ))

Fine-Grained Context Topicalization
The effectiveness of simple attribute similarity-based matching is insufficient, especially in the case of missing and ambiguous information of attributes. However, most of the Chinese web encyclopedia websites have the problems of heterogeneity and ambiguity, such as the lack of attributes and the complicated interleaving of information. This causes difficulties for the entity-matching problem based on attribute information only. Therefore, additional contextual information is needed to supplement it.
Entities in an encyclopedic knowledge base site not only contain structured or semistructured knowledge such as attribute tables but usually also contain texts of description and abstracts of the webpages to which it belongs. The description text describes the entity from different aspects, and the abstract provides a brief summary of the relevant webpages. We refer to the two together as the context information of an entity. Using the vast amount of unstructured contexts present in the knowledge base, we obtain the underlying semantics and relationships and construct entity features for deeper studies through complex probabilistic model building and statistical relationship learning.
An essential feature is that the contexts for the same entity may differ across different knowledge bases, but their topics are similar. Therefore, to exclude the interference caused by textual differences, we perform LDA-based topic modeling of the context and apply it to fine-grained entity alignment.
The similarity calculation of contexts using the LDA topic model includes the following steps:

1.
Data acquisition and pre-processing In this part, we obtained corpus from Baidu Baike and Hudong Baike. These corpora include abstracts and descriptions of encyclopedic entities. After that, we perform word splitting for Chinese and filter out the stopwords. We sorted out four stopword lists (https://github.com/goto456/stopwords accessed on 17 December 2021), cn-stopwords, hit-stopwords, baidu-stopwords, and scu-stopwords from different companies or universities with our supplements.

2.
Topic modeling and parameter estimation LDA modeling finds the topic distribution of each document and the word distribution of each topic. LDA assumes that the prior document-topic distribution is the Dirichlet distribution, that is, for any document d, its topic distribution θ d is: where α is the hyperparameter of the distribution and is a K-dimensional vector, and K is an artificially pre-defined parameter for the number of topics. Furthermore, LDA assumes that the prior topic-word distribution is also the Dirichlet distribution, that is, for any topic k, its word distribution φ k is: where β is the hyperparameter of the distribution and is a V-dimensional vector, and V is the number of words. Based on the Dirichlet distribution, we conduct topic modeling on the corpus after pre-processing. Then perform parameter estimation for topic model, that is, solve θ d , φ k with Gibbs sampling [49]. After that, we can generate the topic feature with solved parameters.

3.
Topic feature generation Topic-word matrix φ k is as follows.
where p ij represents the probability of assigning the j th word to the i th topic, v is the total number of words, and k is the total number of topics.
We take the m (m < v) words with the highest probability in each row of records (p i1 , p i2 , · · · , p iv ) in the topic-word matrix φ k . That is, the set of feature words and the feature matrix are obtained by taking m words under K topics, as shown in Equations (11) and (12), respectively.
where n is the number of non-repeating feature words (n ≤ K · m).
4. Similarity calculation For each word w i i ∈ (1, 2, · · · , n) in the set of feature words W e , find its maximum value in the feature matrix f eature e as the eigenvalue v i of the word. That means the feature vector is obtained.
For the entities e a ande b , their feature vectors are V e a andV e b , respectively. We use cosine similarity to process the feature vectors of different entities to get the similarity between contexts. The contextual similarity of the entity e a , e b is calculated as

Experiment
To verify the performance of our CFEA model, we conducted a thorough experiment. We illustrate the dataset and experimental setting in Sections 4.1 and 4.2 respectively. Then, the evaluation metrics are described in Section 4.3. Finally, we discuss the performance of CFEA in Section 4.4.

Dataset
We download the Zhishi.me (http://openkg.cn/dataset/zhishi-me-dump accessed on 17 December 2021) which is a real dataset collection for a Chinese encyclopedia knowledge base from openkg.cn accessed on 17 December 2021. Due to the large size of the dataset, we extract part of the data from which the original source is Wikipedia (Chinese version), Baidu Baike or Hudong Baike. The extracted information includes entities, attributes, context, etc., contains approximately 300k entities, 2000k pieces of attributes, and 280k pieces of context. The basic statistics of the datasets are presented in Table 2.

Experiment Setting
To dig out the entity pairs that point to the same real-world entities in Wikipedia, Baidu Baike and Hudong Baike, we utilize the model to perform experiments that judge this automatically. According to the conventions of related experiments, we take out the entities from the two encyclopedias one by one and combine them into entity pairs. These entity pairs include all possible matching entity pairs, as well as a large number of unmatched entity pairs. Then, the entity pairs are processed through models whose direct task is to determine whether the entity pair points to the same real entity, pair by pair. The judgment result of the alignment is recorded and compared with the answer file to calculate the precision rate, recall rate, and other indicators.
We also reproduce several baselines models (introduced later) and test them to explore their respective effectiveness. We process the dataset in the python 3.8 environment and use the relevant NLP library such as Gensim. For Zhishi.me, the parameter settings of the CFEA model are shown in Table 3. Among them, value1 is for the data set Baidu-Hudong, and value2 is for Wikipedia-Baidu. The meanings of every parameter are listed in Table 1. The baselines for comparative experiments are as follows: • ED (edit distance): Matching attributes of entities based on edit distance and aligning them based on similarity. • Weighted ED: Weighting of attributes, especially the URL of the entity as an important aspect. Then, alignment experiments are conducted based on the weighted entity attributes. • ED-TFIDF: The two-stage entity alignment model, similar to CFEA. First, it matches attribute information based on the weighted edit distance. Then, it calculates the TF-IDF value of each word in the context and uses the TF-IDF values of all words as the feature vector of that context. This is a text modeling approach that considers frequency features of words but has no semantic information. • LDA [24,43]: This baseline uses LDA to model the context of entities and calculates the similarities between entities in different knowledge bases based on this. Bhattacharya et al. [24] used the LDA model in entity resolution. We apply LDA as a baseline and run it in dataset of this paper. In addition, the literature [43] has made certain improvements to the LDA model and applied it to the entity alignment task of the encyclopedia knowledge base.

Evaluation Metrics
We use precision, recall, and F 1 score as evaluation metrics.

Precision
where N T is the number of correctly aligned entity pairs in the experimental results, while N F is the number of incorrectly aligned entity pairs in the experimental results. Precision indicates the fraction of positive instances among the retrieved instances.

2.
Recall where N A is the number of all alignable entities in the dataset. Recall is the fraction of relevant instances that were retrieved. 3.
F 1 is a composite measure of precision and recall.

Experiment Results
In order to verify the effectiveness of the CFEA model, we conducted comparison experiments using several baselines. Moreover, we explored the effect of the number of topics parameter, i.e., the parameter K and LDA's threshold ω on the model using CFEA with different values. The experimental results and analysis are shown below. Table 4 shows that the proposed method obtains competitive results on both datasets. For example, for the Baidu-Hudong dataset, our method outperforms other compared algorithms on both the F1 metric and the recall metric. As for the WiKi-Baidu dataset, our method performs the second best. The experimental results demonstrate the effectiveness of the proposed method. Moreover, Table 4 also shows that the two-stage models perform better than the single-stage models in most cases. This might be because more information (i.e., abstract and description) is used in these two-stage models. To analyze the influence of the number of topics on the model, we experimentally explored the change curve of the related results for different values of the parameter K. The experimental results are shown in Figure 6. When K is set to 14, the model obtains the best F 1 in Baidu-Hudong. When K is less than 14, the experimental results show that the precision is poor, the recall rate is high, but F 1 is obviously poor. This might be because there are too few topics, and a topic will contain multiple layers of semantic information. At this time, the accuracy rate of the LDA model is low, because in the LDA model it is easy to misunderstand that some entity pairs have similar semantics in text but do not actually match, and so are matched anyway. When the K value is large, the accuracy of the model gradually increases, but the recall rate drops sharply. This might be because too many topics without obvious semantic information will make the matching of the topic model more stringent, that is, only very similar texts can be matched, and the rest will be discarded. The dataset Wikipedia-Baidu has a similar pattern, and the only difference is that the optimal K is set to 16.
The impact of the similarity threshold of LDA on the overall model is given as follows. We change the threshold under the optimal parameter setting. Only the similarity scores that reach this threshold will be regarded as aligned, otherwise it will be judged as misaligned. The relevant experimental results are shown in Figure 7. The CFEA model obtains the best F 1 score when the parameter ω is 0.7. When the parameter is gradually increased from 0.1, the precision of the model gradually increases, but the recall is greatly reduced. The reason may be that a relatively high threshold will cause the model to be too strict on the topical similarity of the text. Although the precision has been improved, it leads to a sharp drop in the recall rate. This makes the F 1 score experience an increase, and then, gradually, a decrease after reaching the peak. In brief, the setting of the LDA threshold is extremely sensitive to the balance between precision and recall. Therefore, we should set a suitable threshold to strike a balance.

Conclusions and Future Work
In this paper, we proposed a coarse-to-fine entity alignment model for Chinese heterogeneous encyclopedia knowledge bases with three stages; a coarse-to-fine process combining the advantages of pruning strategy, attribute similarity-based methods, and context modeling methods for entity alignment. The experimental results with real-world datasets showed that our proposed model is better than several other algorithms.
In future research, the following topics are worthy of further study. Firstly, the blocking-based pruning algorithm needs further improvement. It requires too much manual participation for a blocking algorithm based on the multi-granularity preset attributes proposed in this paper. a blocking algorithm with less manual participation is a topic worthy of study. What is more, the LDA is a bag-of-words model, which does not consider the order of words and cannot dig deeper into semantic information. Therefore, the effectiveness of context modeling has room for further improvement in this field. Data Availability Statement: Not applicable, the study does not report any data.