Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base
Abstract
:1. Introduction
- We design a three-stage unsupervised entity alignment algorithm from the perspective of the human “large-scale-first” cognitive mechanism. It gradually prunes and matches candidate entity pairs from coarse-grained to fine-grained. We uses simple methods in the first and second stages, which greatly reduces the number of entity pairs that need to be compared, thereby improving the efficiency and accuracy of the algorithm through pruning.
- We combine different types of information to improve performance. That is, in three different stages of the model, three aspects of information are used, including entity type information, attribute information, and text information. Ordered by the difficulty in identifying entity pairs, the amount of information used and the complexity of the model gradually increase from the first stage through the third stage.
- We present an experimental study of a real Chinese encyclopedia dataset containing entity attributes and contexts. The experimental results demonstrate that our algorithm outperforms baseline methods.
2. Related Work
2.1. Collective Alignment
- Sampling in the distribution to get the topic distribution of document d.
- Sampling in the distribution to get the topic of word n in document d.
- Sampling in the distribution to get the word distribution of topic d (denoted as k).
- Sampling in the distribution to get the word Finally, the document is generated after the above multiple sampling.
2.2. Similarity-Based Alignment
- Token-based similarity calculation. This method uses a certain function to convert the text string to be matched into a collection of substrings, called collections of tokens, and then uses the similarity function to calculate the similarity of the two token sets to be matched, such as the Jaccard similarity function, cosine similarity function, and the qgram similarity function, etc. Different similarity functions have different characteristics.
- The LCS-based similarity calculation. This method finds the longest common subsequence in two text sequences, and the length ratio of the original sequence is used as the similarity.
- Edit distance-based similarity calculation. This method regards the text string to be matched as a whole, and the minimum cost of the editing operation required to convert one string to another string is used as a measure of the similarity of two strings. Basic editing operations include inserting, deleting, replacing, and swapping positions.
2.3. Alignment for Chinese Encyclopedia
3. CFEA Model
3.1. Preliminaries and Task Description
3.2. Overview of CFEA
3.3. Coarse-Grained Blocking Algorithm
Algorithm 1: Coarse-to-fine entity alignment algorithm |
3.4. Middle-Grained Attribute Similarity
- Attributes of encyclopedia knowledge baseDefinition 3.After the preprocessing of the dataset, entity has the set of attribute names, that is, . The corresponding set of attribute values is . Entity ’s set of attribute names is . Its set of attribute’s values is . where are the number of attributes of the entity , respectively.
- The conditions for matching attribute and
- For the matched attribute names , the similarity of attribute is calculated by
- The attribute similarities between the entities and are calculated as
3.5. Fine-Grained Context Topicalization
- Data acquisition and pre-processingIn this part, we obtained corpus from Baidu Baike and Hudong Baike. These corpora include abstracts and descriptions of encyclopedic entities. After that, we perform word splitting for Chinese and filter out the stopwords. We sorted out four stopword lists (https://github.com/goto456/stopwords accessed on 17 December 2021), cn-stopwords, hit-stopwords, baidu-stopwords, and scu-stopwords from different companies or universities with our supplements.
- Topic modeling and parameter estimationLDA modeling finds the topic distribution of each document and the word distribution of each topic.LDA assumes that the prior document–topic distribution is the Dirichlet distribution, that is, for any document d, its topic distribution is:Furthermore, LDA assumes that the prior topic–word distribution is also the Dirichlet distribution, that is, for any topic k, its word distribution is:Based on the Dirichlet distribution, we conduct topic modeling on the corpus after pre-processing. Then perform parameter estimation for topic model, that is, solve with Gibbs sampling [49]. After that, we can generate the topic feature with solved parameters.
- Topic feature generationTopic-word matrix is as follows.We take the m () words with the highest probability in each row of records in the topic–word matrix . That is, the set of feature words and the feature matrix are obtained by taking m words under K topics, as shown in Equations (11) and (12), respectively.
- Similarity calculationFor each word in the set of feature words , find its maximum value in the feature matrix as the eigenvalue of the word. That means the feature vector is obtained.For the entities , their feature vectors are , respectively. We use cosine similarity to process the feature vectors of different entities to get the similarity between contexts. The contextual similarity of the entity is calculated as
4. Experiment
4.1. Dataset
4.2. Experiment Setting
- ED (edit distance): Matching attributes of entities based on edit distance and aligning them based on similarity.
- Weighted ED: Weighting of attributes, especially the URL of the entity as an important aspect. Then, alignment experiments are conducted based on the weighted entity attributes.
- ED-TFIDF: The two-stage entity alignment model, similar to CFEA. First, it matches attribute information based on the weighted edit distance. Then, it calculates the TF-IDF value of each word in the context and uses the TF-IDF values of all words as the feature vector of that context. This is a text modeling approach that considers frequency features of words but has no semantic information.
- LDA [24,43]: This baseline uses LDA to model the context of entities and calculates the similarities between entities in different knowledge bases based on this. Bhattacharya et al. [24] used the LDA model in entity resolution. We apply LDA as a baseline and run it in dataset of this paper. In addition, the literature [43] has made certain improvements to the LDA model and applied it to the entity alignment task of the encyclopedia knowledge base.
- LCS [39]: This uses the entity’s attributes to calculate the similarity based on LCS and determine whether the entity can be aligned or not.
- Weighted LCS [40]: This performs entity alignment based on LCS after weighting the attributes.
- LCS-TFIDF: A two-stage entity alignment model that uses TF-IDF to model the contexts in the second stage.
- LCS-LDA [41]: This is also a two-stage method of entity alignment; using LCS and LDA has been effective on its dataset.
- WA-Word2Vec [44]: An entity alignment method with weighted average Word2Vec. It uses LTP (language technology platform) to identity the named entities in the text. Then, it deals word vectors and feature vectors using Word2Vec. Finally, the cosine similarity is calculated to align the entities.
- MED-Doc2Vec [45]: A multi-information weighted fusion entity alignment algorithm. It uses minimum edit distance and the Doc2vec model to obtain feature vectors containing semantic information; it obtains the entity’s comprehensive similarity by weighted average to complete the entity-alignment task.
4.3. Evaluation Metrics
- Precision
- Recall
- scoreis a composite measure of precision and recall.
4.4. Experiment Results
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wu, X.; Zhu, X.; Wu, G.Q.; Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 2014, 26, 97–107. [Google Scholar]
- Shao, J.; Bu, C.; Ji, S.; Wu, X. A Weak Supervision Approach with Adversarial Training for Named Entity Recognition. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, 8–12 November 2021; pp. 17–30. [Google Scholar]
- Bu, C.; Yu, X.; Hong, Y.; Jiang, T. Low-Quality Error Detection for Noisy Knowledge Graphs. J. Database Manag. 2021, 32, 48–64. [Google Scholar] [CrossRef]
- Jiang, Y.; Wu, G.; Bu, C.; Hu, X. Chinese Entity Relation Extraction Based on Syntactic Features. In Proceedings of the 2018 IEEE International Conference on Big Knowledge, ICBK 2018, Singapore, 17–18 November 2018; pp. 99–105. [Google Scholar]
- Li, J.; Bu, C.; Li, P.; Wu, X. A coarse-to-fine collective entity linking method for heterogeneous information networks. Knowl.-Based Syst. 2021, 228, 107286. [Google Scholar] [CrossRef]
- Wu, X.; Jiang, T.; Zhu, Y.; Bu, C. Knowledge Graph for China’s Genealogy. IEEE Trans. Knowl. Data Eng. 2021, 1, 1. [Google Scholar] [CrossRef]
- Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A large ontology from wikipedia and wordnet. J. Web Semant. 2008, 6, 203–217. [Google Scholar] [CrossRef] [Green Version]
- Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 10–12 June 2008; pp. 1247–1250. [Google Scholar]
- Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef] [Green Version]
- Philpot, A.; Hovy, E.; Pantel, P. The omega ontology. In Proceedings of the OntoLex 2005-Ontologies and Lexical Resources, Jeju Island, Korea, 15 October 2005. [Google Scholar]
- Xu, B.; Xu, Y.; Liang, J.; Xie, C.; Liang, B.; Cui, W.; Xiao, Y. CN-DBpedia: A never-ending Chinese knowledge extraction system. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Arras, France, 27–30 June 2017; pp. 428–438. [Google Scholar]
- Wang, Z.; Li, J.; Wang, Z.; Li, S.; Li, M.; Zhang, D.; Shi, Y.; Liu, Y.; Zhang, P.; Tang, J. XLore: A Large-scale English-Chinese Bilingual Knowledge Graph. In Proceedings of the International semantic web conference (Posters & Demos), Sydney, Australia, 23 October 2013; Volume 1035, pp. 121–124. [Google Scholar]
- Jiang, T.; Bu, C.; Zhu, Y.; Wu, X. Combining embedding-based and symbol-based methods for entity alignment. Pattern Recognit. 2021, 2021, 108433. [Google Scholar] [CrossRef]
- Jiang, T.; Bu, C.; Zhu, Y.; Wu, X. Two-Stage Entity Alignment: Combining Hybrid Knowledge Graph Embedding with Similarity-Based Relation Alignment. In PRICAI 2019: Trends in Artificial Intelligence—16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Yanuca Island, Fiji, 26–30 August 2019, Proceedings, Part I; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11670, pp. 162–175. [Google Scholar]
- Yan, Z.; Guoliang, L.; Jianhua, F. A survey on entity alignment of knowledge base. J. Comput. Res. Dev. 2016, 53, 165. [Google Scholar]
- Newcombe, H.B.; Kennedy, J.M.; Axford, S.; James, A.P. Automatic linkage of vital records. Science 1959, 130, 954–959. [Google Scholar] [CrossRef]
- Fellegi, I.P.; Sunter, A.B. A theory for record linkage. J. Am. Stat. Assoc. 1969, 64, 1183–1210. [Google Scholar] [CrossRef]
- Herzog, T.N.; Scheuren, F.J.; Winkler, W.E. Data Quality and Record Linkage Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
- Dong, X.; Halevy, A.; Madhavan, J. Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14–16 June 2005; pp. 85–96. [Google Scholar]
- Bhattacharya, I.; Getoor, L. Collective entity resolution in relational data. Acm Trans. Knowl. Discov. Data 2007, 1, 5. [Google Scholar] [CrossRef] [Green Version]
- Maratea, A.; Petrosino, A.; Manzo, M. Extended Graph Backbone for Motif Analysis. In Proceedings of the 18th International Conference on Computer Systems and Technologies, Ruse, Bulgaria, 23–24 June 2017; pp. 36–43. [Google Scholar]
- Pasula, H.; Marthi, B.; Milch, B.; Russell, S.J.; Shpitser, I. Identity uncertainty and citation matching. Presented at the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–13 December 2003; pp. 1425–1432. Available online: http://people.csail.mit.edu/milch/papers/nipsnewer.pdf (accessed on 17 December 2021).
- Tang, J.; Li, J.; Liang, B.; Huang, X.; Li, Y.; Wang, K. Using Bayesian decision for ontology mapping. J. Web Semant. 2006, 4, 243–262. [Google Scholar] [CrossRef] [Green Version]
- Bhattacharya, I.; Getoor, L. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, SIAM, Bethesda, MD, USA, 20–22 April 2006; pp. 47–58. [Google Scholar]
- Hall, R.; Sutton, C.; McCallum, A. Unsupervised deduplication using cross-field dependencies. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 310–317. [Google Scholar]
- McCallum, A.; Wellner, B. Conditional models of identity uncertainty with application to noun coreference. Adv. Neural Inf. Process. Syst. 2004, 17, 905–912. [Google Scholar]
- Domingos, P. Multi-relational record linkage. In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, Citeseer, Washington, DC, USA, 22 August 2004. [Google Scholar]
- Singla, P.; Domingos, P. Entity resolution with markov logic. In Proceedings of the Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; pp. 572–582. [Google Scholar]
- Rastogi, V.; Dalvi, N.; Garofalakis, M. Large-scale collective entity matching. arXiv 2011, arXiv:1103.2410. [Google Scholar] [CrossRef] [Green Version]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Stoilos, G.; Venetis, T.; Stamou, G. A fuzzy extension to the OWL 2 RL ontology language. Comput. J. 2015, 58, 2956–2971. [Google Scholar] [CrossRef] [Green Version]
- Sleeman, J.; Finin, T. Computing foaf co-reference relations with rules and machine learning. In Proceedings of the Third International Workshop on Social Data on the Web, Tokyo, Japan, 29 November 2010. [Google Scholar]
- Zheng, Z.; Si, X.; Li, F.; Chang, E.Y.; Zhu, X. Entity disambiguation with freebase. In Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China, 4–7 December 2012; Volume 1, pp. 82–89. [Google Scholar]
- Navarro, G. A guided tour to approximate string matching. Acm Comput. Surv. 2001, 33, 31–88. [Google Scholar] [CrossRef]
- Smith, T.F.; Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
- Waterman, M.S.; Smith, T.F.; Beyer, W.A. Some biological sequence metrics. Adv. Math. 1976, 20, 367–387. [Google Scholar] [CrossRef] [Green Version]
- Winkler, W.E.; Thibaudeau, Y. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 US Decennial Census; Citeseer: Washington, DC, USA, 1991. [Google Scholar]
- Winkler, W.E. Overview of Record Linkage and Current Research Directions; Bureau of the Census, Citeseer: Washington, DC, USA, 2006. [Google Scholar]
- Raimond, Y.; Sutton, C.; Sandler, M.B. Automatic Interlinking of Music Datasets on the Semantic Web. In Proceedings of the Automatic Interlinking of Music Datasets on the Semantic Web, LDOW, Beijing, China, 28 April 2008. [Google Scholar]
- Xiaohui, Z.; Haihua, J.; Ruihua, D. Property Weight Based Co-reference Resolution for Linked Data. Comput. Sci. 2013, 40, 40–43. [Google Scholar]
- Junfu, H.; Tianrui, L.; Zhen, J.; Yunge, J.; Tao, Z. Entity alignment of Chinese heterogeneous encyclopedia knowledge base. J. Comput. Appl. 2016, 36, 1881–1886. [Google Scholar]
- Weili, Z.; Yanlei, H.; Xiao, L. Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training. Comput. Mod. 2017, 12, 88–93. [Google Scholar]
- Zhenpeng, L.; Mengjie, H.; Bin, Z.; Jing, D.; Jianmin, X. Entity alignment for encyclopedia knowledge base based on topic model. Appl. Res. Comput. 2019, 11, 1–8. [Google Scholar]
- Yumin, L.; Dan, L.; Kai, Y.; Hongsen, Z. Weighted average Word2Vec entity alignment method. Comput. Eng. Des. 2019, 7, 1927–1933. [Google Scholar]
- Jianhong, M.; Shuangyao, L.; Jun, Y. Multi-information Weighted Fusion Entity Alignment Algorithm. Comput. Appl. Softw. 2021, 7, 295–301. [Google Scholar]
- Sun, M.; Zhu, H.; Xie, R.; Liu, Z. Iterative Entity Alignment Via Joint Knowledge Embeddings. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
- Pedrycz, W. Granular Computing: Analysis and Design of Intelligent Systems; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
- Christen, P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 2011, 24, 1537–1555. [Google Scholar] [CrossRef] [Green Version]
- Griffiths, T. Gibbs Sampling in the Generative Model of Latent Dirichlet Allocation. 2002. Available online: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.3760 (accessed on 17 December 2021).
Notation | Description |
---|---|
e | entity e |
knowledge base | |
E | set of knowledge base dataset |
aligned set of knowledge base dataset | |
a | weight of URL in entity’s attributes |
c | cutoff threshold of matched attributes |
u | upper threshold in attributes match |
v | lower threshold in attributes match |
threshold in LDA model | |
K | number of topics |
N | number of words |
D | number of documents |
a priori parameter of the LDA model | |
a priori parameter of the LDA model | |
a topic of word n | |
a word with index n | |
documents-topics matrix for document d | |
topics–words matrix for topic k | |
Levenshtein distance of string a and string b | |
set of feature words of entity e | |
feature probability matrix of entity e | |
feature vector of entity e |
Baidu–Hudong | Wikipedia–Baidu | |||
---|---|---|---|---|
entity | 29,764 | 170,260 | 11,975 | 87,642 |
attribute | 277,945 | 1,691,591 | 108,400 | 841,495 |
context | 27,072 | 152,713 | 11,965 | 78,966 |
aligned entity | 12,511 | 314 |
Parameter | Value1 | Value2 |
---|---|---|
a | 0.5 | 0.5 |
c | 40 | 40 |
u | 0.7 | 0.5 |
v | 0.4 | 0.2 |
0.7 | 0.7 | |
K | 14 | 16 |
Baidu–Hudong | WiKi–Baidu | ||||||
---|---|---|---|---|---|---|---|
F1 | Precision | Recall | F1 | Precision | Recall | ||
single-stage methods | ED | 0.8815 | 0.9695 | 0.8082 | 0.6039 | 0.5087 | 0.7429 |
weight ED | 0.9143 | 0.9714 | 0.8635 | 0.7581 | 0.7705 | 0.7460 | |
LDA | 0.7637 | 0.7330 | 0.7971 | 0.4449 | 0.3425 | 0.6349 | |
LCS | 0.9020 | 0.9716 | 0.8418 | 0.7199 | 0.8153 | 0.6444 | |
weight LCS | 0.9062 | 0.9726 | 0.8482 | 0.7542 | 0.7265 | 0.7841 | |
two-stage methods | weight ED+TFIDF | 0.9409 | 0.9697 | 0.9138 | 0.7586 | 0.7857 | 0.7333 |
weight LCS+TFIDF | 0.9388 | 0.9687 | 0.9106 | 0.7633 | 0.7147 | 0.8190 | |
weight LCS+LDA | 0.9432 | 0.9669 | 0.9206 | 0.8006 | 0.8045 | 0.7968 | |
WA-Word2Vec | 0.8062 | 0.7437 | 0.8801 | 0.6590 | 0.8382 | 0.5429 | |
MED-Doc2vec | 0.9464 | 0.9691 | 0.9248 | 0.8188 | 0.9073 | 0.7460 | |
proposed method | CFEA | 0.9472 | 0.9680 | 0.9273 | 0.8099 | 0.8448 | 0.7778 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, M.; Jiang, T.; Bu, C.; Zhu, B. Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base. Future Internet 2022, 14, 39. https://doi.org/10.3390/fi14020039
Wu M, Jiang T, Bu C, Zhu B. Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base. Future Internet. 2022; 14(2):39. https://doi.org/10.3390/fi14020039
Chicago/Turabian StyleWu, Meng, Tingting Jiang, Chenyang Bu, and Bin Zhu. 2022. "Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base" Future Internet 14, no. 2: 39. https://doi.org/10.3390/fi14020039
APA StyleWu, M., Jiang, T., Bu, C., & Zhu, B. (2022). Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base. Future Internet, 14(2), 39. https://doi.org/10.3390/fi14020039