MDSEA: Knowledge Graph Entity Alignment Based on Multimodal Data Supervision

: With the development of social media, the internet, and sensing technologies, multimodal data are becoming increasingly common. Integrating these data into knowledge graphs can help models to better understand and utilize these rich sources of information. The basic idea of the existing methods for entity alignment in knowledge graphs is to extract different data features, such as structure, text, attributes, images, etc., and then fuse these different modal features. The entity similarity in different knowledge graphs is calculated based on the fused features. However, the structures, attribute information, image information, text descriptions, etc., of different knowledge graphs often have significant differences. Directly integrating different modal information can easily introduce noise, thus affecting the effectiveness of the entity alignment. To address the above issues, this paper proposes a knowledge graph entity alignment method based on multimodal data supervision. First, Transformer is used to obtain encoded representations of knowledge graph entities. Then, a multimodal supervised method is used for learning the entity representations in the knowledge graph so that the vector representations of the entities contain rich multimodal semantic information, thereby enhancing the generalization ability of the learned entity representations. Finally, the information from different modalities is mapped to a shared low-dimensional subspace, making similar entities closer in the subspace, thus optimizing the entity alignment effect. The experiments on the DBP15K dataset compared with methods such as MTransE, JAPE, EVA, DNCN, etc., all achieve optimal results.


Introduction
With the development of cross-disciplinary research between knowledge engineering and multimodal learning, multimodal knowledge graphs (KG) [1] have become increasingly crucial as a means to assist computers in understanding the entity background knowledge in many artificial intelligence applications, such as question answering systems [2], recommendation systems [3], natural language understanding [4], and scene graph generation [5].In recent years, many researchers have constructed numerous multimodal knowledge graphs targeting different domains and languages.Some of the widely used ones include DBpedia, YAGO, and Freebase, which store vast amounts of knowledge and can support various downstream applications.However, most real-world KGs are highly incomplete, primarily because they are often constructed from single data sources.To facilitate knowledge fusion, the task of knowledge graph entity alignment (EA) has received increasing attention from researchers [6].EA aims to identify equivalent entities across KGs while addressing challenges such as multiple languages, heterogeneous graph structures, and different naming conventions.
Early EA was mostly heuristic, and entity mapping was constructed using techniques such as logical reasoning and lexical matching.The recent EA methods are often based on embeddings, learning an embedding space to represent the KG to be aligned so that similar entities are located closer while dissimilar entities are far apart, thereby mitigating heterogeneity issues [7].Specifically, the existing methods can be classified into two categories: (1) translation-based EA methods, employing methods like TransE [8] based on the translation of KG embeddings to capture entity structural information from relation triplets; and (2) graph neural network (GNN)-based EA methods, primarily utilizing methods like graph convolution network (GCN) [9] and GAT [10] for aggregating the neighborhood entity features.In addition to the above-mentioned methods, the effectiveness of EA can be enhanced through various strategies, such as parameter sharing [11] (sharing entity embeddings across KGs, explicitly linking the seed sequences across multiple heterogeneous KGs), iterative learning (IL) [12] (iteratively proposing more alignment seeds from unaligned entities), attribute value encoding [13], collectively stable matching of interdependent alignment decisions [14], or guiding EA through ontology patterns [15].
Translation-based methods primarily learn embeddings based on the translation assumption within each KG.For instance, TransE regards a relation r as a translation from the head entity h to the tail entity t and confirms that a correct knowledge triplet should satisfy the |h + r − t| function, specifically extracting a vector from both the entity matrix and the relationship matrix, performing L 1 or L 2 operations, and obtaining a result that approximates the vector of another entity in the entity matrix, thereby achieving the representation of the relationship between the existing triplets in the knowledge graph through word vectors.Methods like MTransE [16] and ITransE [17] introduce linear transformations to improve the EA performance of KG with multiple mapping relationships at the cost of increasing the model complexity.JAPE [18] and RSNs [19] use parameter sharing to maintain the same embedding between pre-aligned entities.Additionally, Transedge [20] and BootEA [21] integrate entity embeddings into relational embeddings, which together serve as relational representations to solve "one-to-many" and "many-toone" problems.However, due to the fact that embedding based on triples is constrained on a single triplet, it is difficult to capture global graph structure information, which makes it difficult to achieve overall consistency during the alignment process.
To address the aforementioned issues, embeddings based on GNN are utilized to achieve local subgraph-level consistency.The first endeavor in this direction is GCNalign [22], which utilizes the entity relationships in each KG to construct the network structure of GCN.This method embeds multiple languages into a unified vector space and discovers entity alignment based on the distance between the entities in the embedding space.However, GCN-align is mainly aimed at aligning isomorphic graphs, but its processing ability for heterogeneous graphs is weak, resulting in the loss of heterogeneous edge information.Therefore, in recent years, many studies have attempted to integrate edge information into GCN to enhance the relationship perception ability of the model.MuGNN [23] and NAEA [24] introduce attention mechanisms to learn different weights for different types of relationships.HMAN [25] uses GCNs to combine multiple aspects of entity information, including topological connections, relationships, and attributes, to learn entity embeddings.RDGCN [26] merges relational information via the attentional interaction between the original graph and the dual relationship graph, and further captures adjacent structures to learn better entity representations.MRAEA [27] models cross-linguistic entity embeddings directly by focusing on the metasemantics of the incoming and outgoing neighbors and their connection relationships regarding the nodes.PSR [28] proposes a simplified graph encoder with relation graph sampling, which achieves high performance, scalability, and robustness through symmetric non-negative alignment loss and incremental semi-supervised learning.All these efforts demonstrate the importance of relation information in entity alignment.However, these methods do not consider the role of edge alignment in EA and only consider the integrated semantic information of the relationships regarding entity embeddings.Additionally, some recent works incorporate extra external information as weak supervision signals.For instance, EVA [29] proposes a structure-aware uncertainty sampling strategy that can measure the uncertainty of each entity in KG and its impact on the adjacent entities.JEANS [30] jointly represents multilingual KG and text corpora in a shared embedding scheme and seeks to improve the alignment of entities and text with accompanying supervisory signals.Furthermore, CG-MuAlign [31] employs a designed attention mechanism to facilitate the collaborative alignment of positive information from the entity neighborhood and comparably effective negative messages, achieving a joint alignment of multiple types for the entity.ActiveEA [32] designs an active learning framework to create seed comparisons with large amounts of information in order to obtain more effective EA models at lower annotation costs.
With more and more research beginning to explore how to combine visual content in the internet with EA, a new trend is to associate images with entity names to enrich the information of entity pairs.At present, the research mainly focuses on designing fusion methods suitable for cross-modal data to achieve cross-modal EA.Chen et al. [33] generated entity representations of relational knowledge, visual knowledge, and digital knowledge and integrated them through a multimodal knowledge fusion module.Chen et al. [34] employed a modal enhancement mechanism to integrate visual features to guide relational feature learning, and adaptively assigned attention weights to capture valuable attributes for alignment.Lin et al. [35] learned multiple individual representations from multiple modalities and then performed contrastive learning to jointly model the interactions within and between modalities.However, these methods learn multimodal fusion weights at the knowledge graph level, ignoring the intra-modal differences for each entity (such as node degree or relationship quantity) and inter-modal preferences (such as modality absence or ambiguity).This is crucial in real-world EA scenarios since knowledge graphs (especially MMKG) discovered from the internet or professional domains inevitably contain errors and noise, such as those with unrecognized images.Additionally, intra-modal feature differences and inter-modal phenomena such as modality absence, imbalance, or ambiguity are common in KGs.These shortcomings affect their robustness to some extent.
Through research, it was found that the current entity alignment methods have the following three issues: (1) The existing entity alignment methods focus more on the entity alignment of traditional textual knowledge graphs.Some research embeds knowledge graphs from different sources into a low-dimensional space and achieves entity alignment by calculating the similarity between entities, yielding good results.However, these methods only utilize single-modal data (text) and ignore other modal data (images), thus failing to fully exploit the entity feature information in other modal data.(2) Traditional cross-modal entity alignment methods often require extensive manual data annotation or carefully designed alignment features.For example, Zhang [36] proposed an adaptive co-attention network that selected Twitter as the data source, crawled and annotated a dataset containing images, and controlled the preference level of each word for the images and text using gate and filter mechanisms.While these traditional entity alignment methods can achieve high alignment effectiveness, they require a considerable amount of manual annotation, resulting in time wastage and increased labor costs.Moreover, the entity features designed by such methods often lack scalability and universality.(3) Multimodal pre-trained language models achieve cross-modal entity alignment by pre-training on a large amount of unlabeled data.However, this method mostly focuses on global image and text features and is designed only for English textimage pairs.Models like CLIP pre-trained language models do not model the finegrained relationships between text and images, which are valuable in domain-specific multimodal knowledge graph cross-modal EA tasks.Additionally, image-text pairs often contain noise in practice.
Based on these issues, this paper proposes a knowledge graph entity alignment method based on multimodal data supervised (MDSEA).It first uses Transformer to obtain knowledge graph entity encoding representations.Then, it employs a multimodal supervised method for knowledge graph entity representation learning, ensuring that the vector representation of the entities contains rich multimodal semantic information, enhancing the generalization ability of the learned entity representation.Finally, it maps information from different modalities to a shared low-dimensional subspace, making similar entities closer in the subspace, thus optimizing the effect of the entity alignment.The main contributions are as follows: (1) An embedding-based cross-lingual entity alignment method was proposed that uses Transformer to obtain knowledge graph entity encoding representations.Under multimodal information supervision, different models of information are mapped to a shared low-dimensional subspace to achieve entity alignment.(2) We proposed a multimodal supervised strategy for knowledge graph entity representation learning, ensuring that the vector representation of the entities contains rich multimodal semantic information, enhancing the generalization ability of the learned entity representation.(3) We evaluated the proposed method on a real cross-lingual dataset from DBpedia.The experimental results showed that the proposed method outperforms several crosslingual entity alignment methods on Hits@1, Hits@10, and MRR.The framework is simple, fast, and has strong interpretability.

Method
We define a knowledge graph as G = {E, R, A, I, T}, where E, R, A, and I represent the set of entities, relationships, attributes, and images, respectively, and T = {E, R, E} is the set of relationship triplets.Given two knowledge graphs G s = {E s , R s , A s , I s , T s } and G t = {E s , R t , A t , I t , T t }, EA aims to identify entity pairs (e s , e t ), where e s ∈ E s , e t ∈ E t .The model framework is illustrated in Figure 1.Given two multimodal knowledge bases (KBs), the model learns vector embeddings representing different KBs and expects closely embedded entities with potential alignment.The specific algorithmic steps are as follows: (1) Firstly, the Transformer is utilized to obtain encoding representations of knowledge graph entities.(2) Then, multimodal data supervision is employed for learning knowledge graph entity representations, ensuring that the vector representations of entities contain rich multimodal semantic information, thus enhancing the generalization capability of the learned entity representations.(3) Entity embeddings are obtained for all entities, followed by the computation of similarities between all pairs of entities, which are then constrained using neighborhood component analysis (NCA) loss.Iterative learning helps to expand the training set.

Transformer-Based Knowledge Graph Entity Encoding
This section elaborates on how entities from two knowledge graphs, denoted as G s and G t , are embedded into low-dimensional vectors.The L T layer of the Transformer is employed as the entity encoder to extract entity features.This layer is composed of multihead attention (MHA) and feed-forward network (FFN) blocks.When a token sequence {w 1 , . . ., w n } is embedded into a word embedding matrix F T ∈ R n×d T , the entity encoding is computed as follows: where T p represents positional embeddings, LN( • ) denotes layer normalization, F T,l is the hidden feature of the entity at the l-th layer.
MHA is utilized to compute the weighted hidden states for each head, which are then concatenated as where W o ∈ R d×d and d represents the dimensionality of the hidden embeddings.
The FFN consists of two layers of linear transformations with a ReLU activation function: where W 1 ∈ R d×d m and W 2 ∈ R d m ×d .

Multimodal Supervised Learning Network
After obtaining entity encodings using an entity encoder, the entities are fine-tuned to incorporate multimodal information representations through multimodal supervised learning.Specifically, for the relationship and attribute information of entities, features are extracted using a feed-forward network, while graph structure information is acquired using GCN, and image information is extracted using ResNet.
Relationship and Attribute Embedding: Since modeling relationships and attributes using GCNs might result in contaminated entity representations due to noise interference from neighbors [25], a simple feed-forward network is employed to map relationship and attribute features in a low-dimensional space: where W R and W A are parameter matrices for the relationship features F R and attribute features F A , respectively.Graph Structure Embedding: To mimic the structural similarity between G s and G t , capturing the proximity of entities and relationships, a GCN is employed to extract graph structural information.Specifically, a graph can be defined as G = (V, b), where V is a series of nodes {v 1 , v 2 . . ., v G }, and b represents the edge set.The entire feature matrix X ∈ R N×G comprises N feature vectors, X = [x 1 , x 2 , . . ., x N ] T .The sparse symmetric adjacency matrix, denoted by A ∈ R N×N , reflects the connection between each pair of nodes.A ij can be computed using the following radial basis function (RBF): where the parameter γ 1 is empirically set to control the width of the RBF.The diagonal matrix is defined as , where d i = ∑ M j=1 A ij represents the sum of the i-th row of the adjacency matrix.
The multi-layer GCN at the l-th layer is represented as where [•] + represents the ReLU activation function, M = M + I N is the adjacency matrix of G s G t plus the identity matrix (self-connections), D is the trainable layer-specific weight matrix, H (l) ∈ R N×D is the output of the previous layer of GCN, where N is the number of entities and D is the feature dimensionality.H (0) is randomly initialized, and the output of the last layer of GCN is used as the embedded graph structure F G .
Visual Embedding: Resnet-152 has been pre-trained on the ImageNet recognition task and serves as the feature extractor for all images.For each image, we use trainable Resnet-152 to extract image features and use the output of the last layer as a feature representation to obtain visual embedding: The visual representations extracted by Resnet are expected to capture both low-level similarity and high-level semantic correlation between images.
In multimodal supervised learning, the feature similarity matrix of entity embeddings from different modalities is utilized as supervision information, and the following objective function is minimized: where i ∈ {R, A, G, I} represents four different embeddings, and γ 2 is a hyperparameter used to balance the similarity between KBs and their internal similarity.
The objective is to minimize L s to tightly embed semantically similar entities across KGs.

Knowledge Graph Entity Alignment
First, obtain all entity embeddings F J obtained through multimodal data supervised learning, then compute the similarity of all entity pairs, and constrain them using the NCA loss.Simultaneously, use IL to expand the training set.
Embedding Alignment: Let F s J and F t J , respectively, represent the embeddings of the source entity E s and the target entity E t .Compute their cosine similarity matrix: where each entry S ij corresponds to the cosine similarity between the i-th entity in E s and the j-th entity in E t .NCA Loss: Inspired by the NCA-based text-image matching method proposed in [37], a similar form of NCA loss is adopted.It measures the importance of samples using local and global statistics and penalizes hard negatives with a soft weighting scheme.The formula for the NCA loss is as follows: where N is the number of samples, S ij is the cosine similarity between entity sample pairs.Applying NCA loss for classification in the context of producing matches between two sets of entities: where α, β are hyperparameters; M is the number of pivots in a mini-batch.This loss is applied separately to each modality and also to the merged multimodal representation as shown in Equation (10).The joint loss is written as where L i s represents the loss term supervised learning under different embeddings, with losses L R s , L A s , L I s , L G s ; L T s ; L T s applied to the multimodal representation F J .Iterative Learning: In order to improve learning with few training points, this paper adopts an IL strategy to propose more alignment seeds from unaligned entities.Specifically, for each iteration, a new round of proposals is created.Each pair of cross-graph entities, which are nearest neighbors to each other, is proposed and added to the candidate list.If a proposed entity pair remains each other's nearest neighbors in consecutive k rounds (i.e., trial stage), they are permanently added to the training set.Thus, the candidate list is refreshed every K e • K s times.

Experiment
In this section, we conducted experiments on three subsets of the DBP15K dataset (Section 3.1), compared the entity alignment effects (Section 3.3.1) of different methods under the same experimental settings (Section 3.2), and provided an efficiency analysis of the model (Section 3.3.2).At the same time, we also conducted a detailed study on the ablation experiments of different modules of MDSEA (Section 3.3.3).

Experiment Dataset
The DBP15K dataset is a multilingual dataset containing English, Chinese, Japanese, and French, shown in Table 1.It is constructed from the multilingual versions of DBpedia, a large-scale multilingual knowledge base that includes language interlinks from English entities to entities in other languages.During the construction of the DBP15K dataset, 15,000 popular entities were extracted separately from English to Chinese, Japanese, and French, and these were used as reference alignments.The extraction strategy involved randomly selecting a language interlink pair, where the involved entities had at least four relation triples, and then extracting relation and attribute information triples for the selected entities.The number of entities involved in each language far exceeds 15,000, with attribute triples contributing significantly to the dataset.In this experiment, three datasets from DBP15K were utilized: DBP15K ZH−EN (Chinese to English), DBP15K J A−EN (Japanese to English), and DBP15K FR−EN (French to English).Each of these datasets contains approximately 400,000 triples and 15,000 pre-aligned entity pairs, with 30% used as seed alignments (Rs = 0.3).The English, French, and Japanese versions of the entities contain images provided by DBpedia, while Chinese images are extracted from the original Chinese Wikipedia dumps.Additionally, not all entities have images; only around 50-85% of entities have images.For entities without images, a random vector sampled from a normal distribution is assigned, parameterized by the mean and standard deviation of other images.

Experimental Parameter Settings
The experimental platform utilized a server equipped with an Intel i9-12900k CPU (Intel Corporation, Santa Clara, CA, USA) and Nvidia RTX 3080Ti GPU (Nvidia Corporation, Santa Clara, CA, USA).The proposed algorithm was implemented using the Adam optimizer in PyTorch.For training on the DBP15K dataset, the number of epochs was set to 500, with a learning rate of 0.001.To ensure fair comparison, the experimental setup employed the same training/testing split as common methods: 30% of pre-aligned entities were used for training, while the remaining 70% of anchor links were used for testing, with 20% of training entity pairs reserved for validation.To demonstrate the model's stability, the visual encoder was set to ResNet-152 with visual feature dimensions of 2048.Each experiment was conducted 10 times, and the results were averaged to reduce randomness.
In the experiments, the effectiveness of multimodal entity alignment was evaluated as an indicator of the proposed model's performance.Specifically, three common metrics were employed: the average percentage of triplets ranked 1 in the test samples (Hits@1), the average percentage of triplets ranked below 10 in the test samples (Hits@10), and the mean reciprocal rank (MRR).
The experimental parameters, including regularization parameters and network model parameters, were adjusted within given ranges to maximize classification accuracy.To achieve this, a 10-fold cross-validation was performed on the training set to determine parameter combinations for different methods.Additionally, feature dimensionality was identified as a key parameter affecting the quality of the final learned feature representations.Therefore, the optimal feature dimensionality was determined by testing values ranging from 100 to 500 at intervals of 5, based on the best classification performance on the training set.

Ablation Study
To demonstrate the effectiveness of each module in the proposed method, two variants of the knowledge graph entity alignment method based on multimodal data, MDSEA-A and MDSEA-B, were proposed.The final classification performance under different modules was compared.The specific module selections are shown in Table 4, and the alignment results are presented in Table 5.Where MDS represents the multimodal data supervision module, and MWF represents the multimodal weighted fusion module.The ablation comparisons are shown in Table 5 and Figure 3.  Compared to MDSEA-A, MDSEA-B improves the Hit@1 on DBP15K ZH−EN by 1.82%, indicating that the multimodal supervision strategy enriches the vector representations of entities with rich multimodal semantic information, enhancing the generalization ability of the learned entity representations.Compared to MDSEA-B, MDSEA improves the Hit@1 on DBP15K ZH−EN by 1.74%, attributed to the multimodal weighted fusion strategy reducing noise from different modal information.

Conclusions
This article proposes a knowledge graph entity alignment method based on multimodal supervised learning.Firstly, it utilizes Transformer to obtain the encoded representations of knowledge graph entities.Then, it employs a multimodal supervised learning approach for knowledge graph entity representation learning.This ensures that the vector representations of the entities contain rich multimodal semantic information, thereby enhancing the generalization capability of the learned entity representations.Finally, it maps information from different modalities into a shared low-dimensional subspace, making similar entities closer in the subspace, thus optimizing the entity alignment effect.The proposed method is compared with common entity alignment methods, and the results demonstrate its superiority over the state-of-the-art baseline methods.In addition, due to the lack of some visual modalities in the dataset, the multimodal supervised learning of the model is limited to some extent.Therefore, we will conduct in-depth research on this issue in subsequent work.

Figure 1 .
Figure 1.The framework of the proposed MDSEA.

Figure 2 .
Figure 2. Comparison of efficiency results of different methods on the DBP15K ZH−EN .

Table 4 .
Selection of different modules in MDSEA.

Table 5 .
The alignment effect of different modules in the DBP15K ZH−EN .