Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion
Abstract
1. Introduction
- We propose a vision-guided negative sample generation module that combines nearest neighbor negative sampling and contrastive learning methods. This module generates entities that are somewhat related to the positive samples, enhancing the ability to learn relationship representations.
- We propose a multi-level feature fusion strategy, incorporating a soft attention mechanism to adaptively capture the importance weights of modalities. Additionally, cross-attention and bidirectional cross-modal fusion methods are introduced to aggregate multi-granularity representations of entity modalities across three hierarchical levels.
- We conducted extensive experiments on several existing public datasets. The overall results demonstrate that our model, ERMF, achieves state-of-the-art performance among baseline models, proving the effectiveness of our approach.
2. Related Work
2.1. Multi-Modal Knowledge Graphs
2.2. Multi-Modal Knowledge Graph Embedding
2.3. Multi-Modal Entity Alignment
3. Our Approach
3.1. Problem Definition
3.2. Overview
3.3. Multi-Modal Knowledge Embedding
3.3.1. Structure Embedding
3.3.2. Attribute Embedding
3.3.3. Image Embedding
3.3.4. Relation Embedding
3.3.5. Visual-Guided Negative Sampling
- The strategy of Algorithm 1 prioritizes selecting entities with visual similarity close to but below the threshold as negative samples. This ensures a certain relevance between the negative and positive samples, avoiding completely irrelevant noise. The logic is implemented in Steps 6–7:
- –
- If : Replace current_ent with entities that are visually dissimilar but semantically related (e.g., entities with similarity near ) to enhance relational embedding learning.
- –
- Otherwise (): Apply nearest neighbor negative sampling (Steps 9–10): ∘: Select the top-5 entities from entities_list with the highest cosine similarity to entities_list in the joint embedding space (structure + attributes), then randomly replace one of them. This ensures semantic rationality of negative samples.
- The generated negative triples are added to the set, and the loop terminates early if the target neg_triples_num is reached (Steps 12–15). Finally, the negative triples for the current triple are merged into the final neg_batch (Step 16).
Algorithm 1 generate_neg_triples_visual_guided () |
Input: pos_batch,all_triples_set,entities_list,entity_similar_dic,neg_triples_num Output: List of negative triples
|
3.4. Multi-Modal Features Fusion
- 1.
- Bottom-layer Fusion: Modeling the Importance of Modalities.
- 2.
- Middle-layer Fusion: Modeling within Modalities and Interaction.
- (1)
- Modeling Intra-modal Interaction
- (2)
- Modeling Inter-modal Interaction
- 3.
- Top-layer Fusion: Feature Integration and Optimization
Algorithm 2 Multi_Layer_feature_fusion () |
Input: : Embedding vectors for each modal Output: final entity embeddings
|
4. Optimization Objective
5. Experiments
5.1. Experimental Setup
5.2. Main Results
5.2.1. Comparison Experiments
5.2.2. Ablation Experiments
5.2.3. Parametric Analysis
5.2.4. Convergence Speed Comparison
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Sun, H.; Wang, J.; Wang, J.; Tang, W.; Qi, Q.; Sun, S.; Liao, J. Towards semantic consistency: Dirichlet energy driven robust multi-modal entity alignment. arXiv 2024, arXiv:2401.17859. [Google Scholar]
- Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
- Mahdisoltani, F.; Biega, J.; Suchanek, F.M. Yago3: A knowledge base from multilingual wikipedias. In Proceedings of the CIDR, Asilomar, CA, USA, 6–9 January 2013. [Google Scholar]
- Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]
- Jiang, Z.; Chi, C.; Zhan, Y. Research on medical question answering system based on knowledge graph. IEEE Access 2021, 9, 21094–21101. [Google Scholar] [CrossRef]
- Huang, X.; Zhang, J.; Li, D.; Li, P. Knowledge graph embedding based question answering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, VIC, Australia, 11–15 February 2019; pp. 105–113. [Google Scholar]
- Zeng, Y.; Jin, Q.; Bao, T.; Li, W. Multi-modal knowledge hypergraph for diverse image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3376–3383. [Google Scholar]
- Sun, R.; Cao, X.; Zhao, Y.; Wan, J.; Zhou, K.; Zhang, F.; Wang, Z.; Zheng, K. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 1405–1414. [Google Scholar]
- Chen, L.; Li, Z.; Wang, Y.; Xu, T.; Wang, Z.; Chen, E. MMEA: Entity alignment for multi-modal knowledge graph. In Knowledge Science, Engineering and Management, Proceedings of the 13th International Conference, KSEM 2020, Hangzhou, China, 28–30 August 2020; Proceedings, Part I 13; Springer: Berlin/Heidelberg, Germany, 2020; pp. 134–147. [Google Scholar]
- Liu, F.; Chen, M.; Roth, D.; Collier, N. Visual pivoting for (unsupervised) entity alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 19–21 May 2021; Volume 35, pp. 4257–4266. [Google Scholar]
- Chen, L.; Li, Z.; Xu, T.; Wu, H.; Wang, Z.; Yuan, N.J.; Chen, E. Multi-modal siamese network for entity alignment. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 118–126. [Google Scholar]
- Liu, Y.; Li, H.; Garcia-Duran, A.; Niepert, M.; Onoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal knowledge graphs. In The Semantic Web, Proceedings of the 16th International Conference, ESWC 2019, Portorož, Slovenia, 2–6 June 2019; Proceedings 16; Springer: Berlin/Heidelberg, Germany, 2019; pp. 459–474. [Google Scholar]
- Chen, Z.; Chen, J.; Zhang, W.; Guo, L.; Fang, Y.; Huang, Y.; Zhang, Y.; Geng, Y.; Pan, J.Z.; Song, W.; et al. Meaformer: Multi-modal entity alignment transformer for meta modality hybrid. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3317–3327. [Google Scholar]
- Lin, Z.; Zhang, Z.; Wang, M.; Shi, Y.; Wu, X.; Zheng, Y. Multi-modal contrastive representation learning for entity alignment. arXiv 2022, arXiv:2209.00891. [Google Scholar]
- Wang, M.; Wang, H.; Qi, G.; Zheng, Q. Richpedia: A large-scale, comprehensive multi-modal knowledge graph. Big Data Res. 2020, 22, 100159. [Google Scholar] [CrossRef]
- Chaudhary, C.; Goyal, P.; Prasad, D.N.; Chen, Y.P.P. Enhancing the quality of image tagging using a visio-textual knowledge base. IEEE Trans. Multimed. 2019, 22, 897–911. [Google Scholar] [CrossRef]
- Huang, J.; Chen, Y.; Li, Y.; Yang, Z.; Gong, X.; Wang, F.L.; Xu, X.; Liu, W. Medical knowledge-based network for patient-oriented visual question answering. Inf. Process. Manag. 2023, 60, 103241. [Google Scholar] [CrossRef]
- Mondal, P.; Chakder, D.; Raj, S.; Saha, S.; Onoe, N. Graph convolutional neural network for multimodal movie recommendation. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, Tallinn, Estonia, 27–31 March 2023; pp. 1633–1640. [Google Scholar]
- Xie, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. arXiv 2016, arXiv:1609.07028. [Google Scholar]
- Wang, Z.; Li, L.; Li, Q.; Zeng, D. Multimodal data enhanced representation learning for knowledge graphs. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Proceedings of the NIPS’13: Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [Google Scholar]
- Moon, C.; Jones, P.; Samatova, N.F. Learning entity type embeddings for knowledge graph completion. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2215–2218. [Google Scholar]
- Lu, X.; Wang, L.; Jiang, Z.; He, S.; Liu, S. MMKRL: A robust embedding approach for multi-modal knowledge graph representation learning. Appl. Intell. 2022, 52, 7480–7497. [Google Scholar] [CrossRef]
- Cheng, B.; Zhu, J.; Guo, M. MultiJAF: Multi-modal joint entity alignment framework for multi-modal knowledge graph. Neurocomputing 2022, 500, 581–591. [Google Scholar] [CrossRef]
- Zhu, J.; Huang, C.; De Meo, P. DFMKE: A dual fusion multi-modal knowledge graph embedding framework for entity alignment. Inf. Fusion 2023, 90, 111–119. [Google Scholar] [CrossRef]
- Wu, Y.; Liu, X.; Feng, Y.; Wang, Z.; Yan, R.; Zhao, D. Relation-Aware Entity Alignment for Heterogeneous Knowledge Graphs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-2019, Macao, China, 10–16 August 2019; pp. 5278–5284. [Google Scholar] [CrossRef]
- Shi, Y.; Wang, M.; Zhang, Z.; Lin, Z.; Zheng, Y. Probing the Impacts of Visual Context in Multimodal Entity Alignment. In Proceedings of the Web and Big Data, Wuhan, China, 6–8 October 2023; Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T., Eds.; Springer: Cham, Switzerland, 2023; pp. 255–270. [Google Scholar]
- Xu, B.; Xu, C.; Su, B. Cross-modal graph attention network for entity alignment. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3715–3723. [Google Scholar]
- Chen, L.; Sun, Y.; Zhang, S.; Ye, Y.; Wu, W.; Xiong, H. Tackling Uncertain Correspondences for Multi-Modal Entity Alignment. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Hu, Z.; Gutiérrez-Basulto, V.; Xiang, Z.; Li, R.; Pan, J.Z. Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment. arXiv 2024, arXiv:2404.17590. [Google Scholar]
- Wang, L.; Qi, P.; Bao, X.; Zhou, C.; Qin, B. Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 9116–9124. [Google Scholar] [CrossRef]
- Guo, H.; Tang, J.; Zeng, W.; Zhao, X.; Liu, L. Multi-modal entity alignment in hyperbolic space. Neurocomputing 2021, 461, 598–607. [Google Scholar] [CrossRef]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10–48550. [Google Scholar]
- Li, C.; Cao, Y.; Hou, L.; Shi, J.; Li, J.; Chua, T.S. Semi-Supervised Entity Alignment via Joint Knowledge Embedding Model and Cross-Graph Model; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
- Xia, L.; Mi, S.; Zhang, J.; Luo, J.; Shen, Z.; Cheng, Y. Dual-Stream Feature Extraction Network Based on CNN and Transformer for Building Extraction. Remote Sens. 2023, 15, 2689. [Google Scholar] [CrossRef]
- Liu, Y.H. Feature extraction and image recognition with convolutional neural networks. J. Phys. Conf. Ser. 2018, 1087, 062032. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Guo, H.; Li, X.; Tang, J.; Guo, Y.; Zhao, X. Adaptive Feature Fusion for Multi-modal Entity Alignment. Acta Autom. Sin. 2024, 50, 758–770. [Google Scholar] [CrossRef]
- Guo, L.; Chen, Z.; Chen, J.; Chen, H. Revisit and outstrip entity alignment: A perspective of generative models. arXiv 2023, arXiv:2305.14651. [Google Scholar]
- Li, Q.; Guo, S.; Luo, Y.; Ji, C.; Wang, L.; Sheng, J.; Li, J. Attribute-consistent knowledge graph representation learning for multi-modal entity alignment. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2499–2508. [Google Scholar]
- Li, Q.; Ji, C.; Guo, S.; Liang, Z.; Wang, L.; Li, J. Multi-modal knowledge graph transformer framework for multi-modal entity alignment. arXiv 2023, arXiv:2310.06365. [Google Scholar]
- Zhang, X.; Chen, T.; Wang, H. A Novel Method for Boosting Knowledge Representation Learning in Entity Alignment through Triple Confidence. Mathematics 2024, 12, 1214. [Google Scholar] [CrossRef]
- Li, Q.; Li, J.; Wu, J.; Peng, X.; Ji, C.; Peng, H.; Wang, L.; Philip, S.Y. Triplet-aware graph neural networks for factorized multi-modal knowledge graph entity alignment. Neural Netw. 2024, 179, 106479. [Google Scholar] [CrossRef] [PubMed]
Datasets | Entities | Relations | Relation Triples | Numerical Triples | Image | Same As |
---|---|---|---|---|---|---|
DB15K | 12,842 | 1345 | 89,197 | 48,080 | 12,837 | 12,846 |
YAGO15K | 15,404 | 279 | 122,886 | 23,532 | 11,194 | 11,199 |
FB15K | 14,951 | 32 | 592,213 | 29,395 | 13,444 | - |
Method | Core Idea |
---|---|
MMEA [9] | Learns embeddings for different modalities using separate feature encoders and aligns entities by computing the similarity between joint representations. |
EVA [10] | Constructs seed dictionaries based on entity visual similarity, enabling semi-supervised or unsupervised iterative learning. |
HMEA [32] | Selects aligned entities by aggregating structural and visual representations in hyperbolic space. |
MultiJAF [24] | Employs attention-based multi-modal fusion to dynamically learn modality weights for enhanced entity alignment. |
AF2MEA [38] | Introduces visual feature processing and triplet filtering modules to mitigate structural discrepancies. |
GEEA [39] | Utilizes modality-specific variational autoencoders to generate reconstructed joint embeddings for alignment. |
Meaformer [13] | Generates dynamic modality-specific meta-weights for entities and aligns them through weighted similarity. |
MCLEA [14] | Combines intra-modality contrastive loss and inter-modality alignment loss to enhance cross-modal interactions. |
MSNEA [11] | Integrates visual features via cross-modal enhancement and multi-modal contrastive learning. |
ACK-MMEA [40] | Builds attribute-consistent knowledge graphs with relation-aware GNN aggregation. |
MoAlign [41] | Preserves modality-specific semantics using transformer encoders to address spatial misalignment. |
Confidence-MMEA [42] | Incorporates triple confidence scores to quantify assertion correctness during representation learning. |
TriFac [43] | Employs triplet-aware GNNs with two-stage graph decomposition for multi-modal alignment refinement. |
Model | FB15K-DB15K (20%) | FB15K-DB15K (50%) | FB15K-DB15K (80%) | ||||||
Hits@1 | Hits@10 | MRR | Hits@1 | Hits@10 | MRR | Hits@1 | Hits@10 | MRR | |
HMEA | 12.7 | 36.9 | - | 26.2 | 58.1 | - | 41.7 | 78.6 | - |
EVA | 28.9 | 54.5 | 35.2 | 45.3 | 72.9 | 53.8 | 63.5 | 85.1 | 71.6 |
MMEA | 26.5 | 54.1 | 35.7 | 41.7 | 70.3 | 51.2 | 59.0 | 86.9 | 68.5 |
MultiJAF | 21.6 | 49.2 | 30.3 | - | - | - | - | - | - |
AF2MEA | 17.8 | 34.1 | 23.3 | 29.5 | 50.3 | 36.5 | - | - | - |
ACK-MMEA | 30.4 | 54.9 | 38.7 | 56.0 | 73.6 | 62.4 | 68.2 | 87.4 | 75.2 |
MoAlign | 31.8 | 56.4 | 40.9 | 57.6 | 74.9 | 63.4 | 69.9 | 88.2 | 77.3 |
MEAformer | 41.7 | 71.5 | 51.8 | 61.9 | 84.3 | 69.8 | 76.5 | 91.6 | 82.0 |
MCLEA | 29.5 | 58.2 | 39.3 | 55.5 | 78.4 | 63.7 | 73.5 | 89.0 | 79.0 |
MSNEA | 11.4 | 29.6 | 17.5 | 28.8 | 59.0 | 38.8 | 51.8 | 77.9 | 61.3 |
Confidence-MMEA | 28.0 | 56.4 | 37.6 | 49.8 | 76.4 | 59.0 | 68.7 | 89.5 | 76.4 |
TriFac | 31.8 | 55.9 | 38.9 | 55.4 | 75.0 | 60.7 | 69.7 | 88.2 | 76.1 |
ERMF | |||||||||
Model | FB15K-YG15K (20%) | FB15K-YG15K (50%) | FB15K-YG15K (80%) | ||||||
Hits@1 | Hits@10 | MRR | Hits@1 | Hits@10 | MRR | Hits@1 | Hits@10 | MRR | |
HMEA | 10.5 | 31.3 | - | 26.5 | 58.1 | - | 43.3 | 80.1 | - |
EVA | 25.0 | 46.2 | 33.5 | 47.8 | 68.3 | 56.1 | 64.0 | 84.5 | 72.5 |
MMEA | 23.4 | 48.0 | 31.7 | 40.3 | 64.5 | 48.6 | 59.8 | 83.9 | 68.2 |
MultiJAF | 20.1 | 43.8 | 30.4 | - | - | - | - | - | - |
AF2MEA | 21.7 | 40.2 | 28.2 | 35.7 | 56.0 | 42.3 | - | - | - |
ACK-MMEA | 28.9 | 49.6 | 36.0 | 53.5 | 69.9 | 59.3 | 67.6 | 86.4 | 74.4 |
MoAlign | 29.6 | 52.5 | 37.8 | 55.0 | 71.3 | 61.7 | 68.9 | 88.4 | 76.9 |
MEAformer | 32.7 | 59.5 | 41.7 | 56.0 | 77.8 | 63.9 | 70.3 | 87.3 | 76.6 |
MCLEA | 25.4 | 48.4 | 33.2 | 50.1 | 70.5 | 57.4 | 66.7 | 82.4 | 72.2 |
MSNEA | 10.3 | 24.9 | 15.3 | 32.0 | 58.9 | 41.3 | 53.1 | 77.8 | 63.9 |
Confidence-MMEA | 26.1 | 50.3 | 34.3 | 48.5 | 73.8 | 57.3 | 67.2 | 87.1 | 74.5 |
TriFac | 29.0 | 50.8 | 37.1 | 54.6 | 69.4 | 57.9 | 66.9 | 86.5 | 73.6 |
ERMF |
Model | FB15K-DB15K | FB15K-YAGO15K | ||||
---|---|---|---|---|---|---|
Hits@1 | Hits@10 | MRR | Hits@1 | Hits@10 | MRR | |
ERMF | 45.2 | 74.8 | 54.9 | 41.2 | 68.1 | 49.7 |
= 0.0 | 41.6 | 70.7 | 50.4 | 39.5 | 66.3 | 47.8 |
= 0.1 | 43.2 | 72.7 | 52.8 | 40.6 | 66.9 | 48.2 |
= 0.2 | 43.7 | 73.3 | 53.7 | 40.9 | 67.8 | 48.9 |
= 0.3 | 44.3 | 74.6 | 54.1 | 41.3 | 68.0 | 49.3 |
= 0.4 | 45.2 | 74.8 | 54.9 | 41.2 | 68.1 | 49.7 |
= 0.5 | 45.0 | 73.9 | 54.6 | 39.7 | 67.9 | 49.1 |
= 0.6 | 44.6 | 72.7 | 52.6 | 36.5 | 65.6 | 46.6 |
= 0.7 | 35.4 | 64.5 | 45.8 | 27.4 | 52.4 | 37.4 |
= 0.8 | 30.7 | 53.4 | 35.4 | 18.5 | 43.9 | 27.3 |
= 0.9 | 27.5 | 48.3 | 34.5 | 17.3 | 37.8 | 24.5 |
= 1.0 | 24.3 | 43.2 | 29.2 | 15.9 | 34.6 | 21.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Hou, Y.; Liu, J.; Zhang, P.; Wang, C.; Liu, K. Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion. Symmetry 2025, 17, 990. https://doi.org/10.3390/sym17070990
Li H, Hou Y, Liu J, Zhang P, Wang C, Liu K. Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion. Symmetry. 2025; 17(7):990. https://doi.org/10.3390/sym17070990
Chicago/Turabian StyleLi, Huayu, Yujie Hou, Jing Liu, Peiying Zhang, Cuicui Wang, and Kai Liu. 2025. "Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion" Symmetry 17, no. 7: 990. https://doi.org/10.3390/sym17070990
APA StyleLi, H., Hou, Y., Liu, J., Zhang, P., Wang, C., & Liu, K. (2025). Multi-Modal Entity Alignment Based on Enhanced Relationship Learning and Multi-Layer Feature Fusion. Symmetry, 17(7), 990. https://doi.org/10.3390/sym17070990