A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement

Yuan, Song; Lu, Zexin; Li, Qiyuan; Gu, Jinguang

doi:10.3390/bdcc7020077

Open AccessArticle

A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement

by

Song Yuan

^1,2,3,†,

Zexin Lu

^1,2,3,†,

Qiyuan Li

^1,2,3,† and

Jinguang Gu

^1,2,3,*

¹

College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China

²

Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan 430065, China

³

Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content, National Press and Publication Administration, Beijing 100038, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2023, 7(2), 77; https://doi.org/10.3390/bdcc7020077

Submission received: 22 March 2023 / Revised: 13 April 2023 / Accepted: 14 April 2023 / Published: 18 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

Due to inter-modal effects hidden in multi-modalities and the impact of weak modalities on multi-modal entity alignment, a Multi-modal Entity Alignment Method with Inter-modal Enhancement (MEAIE) is proposed. This method introduces a unique modality called numerical modality in the modal aspect and applies a numerical feature encoder to encode it. In the feature embedding stage, this paper utilizes visual features to enhance entity relation representation and influence entity attribute weight distribution. Then, this paper introduces attention layers and contrastive learning to strengthen inter-modal effects and mitigate the impact of weak modalities. In order to evaluate the performance of the proposed method, experiments are conducted on three public datasets: FB15K, DB15K, and YG15K. By combining the datasets in pairs, compared with the current state-of-the-art multi-modal entity alignment models, the proposed model achieves a 2% and 3% improvement in Top-1 Hit Rate(Hit@1) and Mean Reciprocal Rank (MRR), demonstrating its feasibility and effectiveness.

Keywords:

multi-modal entity alignment; feature encoding; inter-modal effects; contrastive learning; attention

1. Introduction

In recent years, multi-modal knowledge graphs have gradually emerged which express knowledge of the natural world in multiple forms such as text, image, and audio. Their emergence has driven the development of applications such as question-answering and recommender systems [1,2,3]. In addition, the application of knowledge graphs to various domains is discussed more broadly, for example, in the area of computer security [4,5]. Because the real world contains a vast scope of knowledge and most multi-modal knowledge graphs are incomplete, their knowledge can complement each other. Specifically, knowledge graphs are often composed of multiple types of information, such as entities, attributes, and relationships. However, because acquiring and maintaining this information is a complex and expensive task, most multi-modal knowledge graphs typically need to be completed. This means essential information, such as relationships between entities and the range of attribute values, should be included in the knowledge graph. This has led to limitations in the practical applications of knowledge graphs in the real world. For example, consider two commonly used knowledge graph datasets: DB15K and FB15K. In DB15K, if we want to obtain information on all actors who played the role of The Flash in the movies and their information, the dataset only contains information about the movies in which the actors appeared but needs to include basic information such as their hobbies. On the other hand, FB15K contains essential information on several generations of actors. By combining these two datasets, we could obtain basic information on several generations of actors and information on the movies in which they appeared. Therefore, effective integration of the helpful knowledge from various multi-modal knowledge graphs is crucial, which has made multi-modal entity alignment tasks a popular area of research [6].

Existing entity alignment methods for traditional knowledge graphs mainly explore the similarity of relations or graph structures. Methods based on translation-based embeddings, such as MTransE [7] and AMEAlign [8], mine the semantics of multiple relations for entity alignment, while GCN-Align [9] and OntoEA [10] model the global graph structure. However, their alignment accuracy could be higher, and differences in structure and semantics between different knowledge graphs further reduce their effectiveness. Recent research has shown that using multi-modal knowledge for entity alignment has performed well. Multi-modal knowledge can alleviate the problem of data sparsity, and combining different types of knowledge can reduce the impact of errors from singular knowledge and improve the model’s robustness. For example, MMEA [11] uses a variety of entity relations, numerical values, and images to enhance entity alignment performance. At the same time, EVA [12] incorporates visual features and finds them crucial for multi-modal entity alignment. However, EVA only considers visual knowledge, and the absence of visual knowledge can significantly reduce entity alignment performance. MSNEA [13] proposes a modal embedding module considering relation, attribute, and image knowledge. After obtaining the feature representation of each modality, this paper performs modal feature fusion to mitigate data noise resulting from the absence of specific modalities and further improve entity alignment performance. However, in experiments, this paper found that more than the use of relation, attribute, and image, it is required to find complete equivalent entities. An entity in one knowledge graph may have structure and image features similar to a non-equivalent entity in another, causing alignment errors. Therefore, the effective use of additional modalities can improve the accuracy of entity alignment, and careful consideration is needed in the reasonable selection and service of additional modalities.

In addition, most existing methods for multi-modal entity alignment focus on directly merging or simply concatenating modal features [14] after using different modal embeddings without considering the cross-modal interactions during the modeling process. The interaction between different modal information plays a vital role in multi-modal representation [15]. So, how to introduce cross-modal effects in multi-modal entity alignment has always been a problem. Currently, most methods in multi-modal entity alignment directly form single-modality feature representations and send them to the feature fusion stage, ignoring the feature enhancement representation between modalities. The graph modality, such as entity attributes, usually has sparse and heterogeneous properties. If all attributes are assigned the same weight without considering their importance to the entity, it will introduce noise and degrade the entity alignment performance. Moreover, most methods treat modalities as equally crucial in the joint embedding process. Still, weak modalities provide limited information and contribute less to entity alignment, decreasing overall alignment accuracy.

In summary, this paper propose a multi-modal entity alignment method based on cross-modal enhancement to address the problems of missing auxiliary modalities, insufficient cross-modal effects, and weak modality influence. The main contributions of this paper are as follows:

To address the problem of missing modalities, this paper proposes to add a unique numerical modality based on existing additional modalities, such as structure, relation, and attribute, to improve the information on additional modalities. We extracted numerical triplets from the original dataset and sent the numerical information to the radial basis function network. We then concatenated the resulting feature vectors with attribute embeddings and combined them with entity embeddings to form numerical embeddings. In order to ensure the accuracy of the numerical embeddings, we generated negative numerical triplets by swapping aligned entities in the given positive numerical triplets. We used contrastive learning to improve the credibility of the embeddings.
To overcome the problem of insufficient cross-modal effects, this paper proposes a novel approach that utilizes pre-trained visual models to obtain visual features of entities and applies them to entity embeddings to enhance the representation of visual interaction relations. We also use visual feature vectors and apply attention mechanisms to allocate entity attribute weights, forming enhanced entity attribute features. Specifically, we first use existing visual models to extract the visual features of entities. These visual features are then concatenated with entity embeddings to form enhanced entity embeddings. Next, we use these enhanced entity embeddings to represent the visual interaction relations between entities, better utilizing visual information to infer relations between entities. Moreover, we also use visual feature vectors and apply attention mechanisms to allocate entity attribute weights. This way, we can adjust the attribute weights based on the visual features of entities, thereby better utilizing attribute information to infer relations between entities. By adopting this approach, we can more comprehensively and accurately describe the entity relation, enhancing knowledge graphs’ application value.
To address the problem of the excessive influence of weak modalities, this paper proposes a method of dynamically allocating modal weights. Specifically, we dynamically calculate the importance of each modality in the current alignment task using attention mechanisms and neural networks, thus avoiding the over-influence of weak modalities. In modality calculation, we first represent each modality using embedding representations, then we use a multi-layer perception to calculate the importance score of each modality, and finally we use an attention mechanism to calculate the weighted sum of modalities to obtain the weighted modality embedding representation. Through this method, we can better utilize multi-modal information to improve the accuracy and efficiency of alignment while avoiding the over-influence of weak modalities.

2. Related Work

2.1. Multi-Modal Knowledge Graph

As the form and quantity of knowledge continue to increase, researchers have proposed large-scale multi-modal knowledge graphs one after another. For example, MMKG [16] used additional forms of knowledge (mainly images) to construct a multi-modal knowledge graph. Similarly, Richpedia [17] tried various methods in 2020 to enrich the knowledge graph. Its goal is to improve the information of the knowledge graph by adding sufficient and diverse images to the text entities. Some studies also aim to add audio and other forms of knowledge to the knowledge graph to ensure its diversity. For instance, Oramas [18] uses a knowledge graph to provide information for a hybrid recommendation engine, which incorporates audio into the multi-modal knowledge graph to apply multi-modal knowledge graph techniques for music or audio recommendations. The emergence of multi-modal knowledge graphs has led to ongoing discussions on embedding modal knowledge features in knowledge graphs.

2.2. Entity Alignment

Currently, research on entity alignment can be divided into traditional entity alignment methods and multi-modal entity alignment methods. Traditional methods can be seen as a single-modality information-based entity alignment method. For example, IPTransE [19] learns the representation of entity relations on each knowledge graph and then maps the two embedded entities into the same low-dimensional space for the entity alignment task. SEA [20] proposes a semi-supervised entity alignment method that aligns labeled entities with unlabeled ones in the knowledge graph and optimizes the knowledge graph embedding through adversarial training. AliNet [21] and GCN-Align [22] are GNN-based entity alignment methods [23,24,25] that discover the correlations between entities in the embedding space to perform entity alignment and combine structural and attribute embedding to improve alignment accuracy. These traditional entity alignment methods are relatively easy to understand and implement and have high accuracy and stability when using high-quality data. However, they generally require a large amount of data to train the model, and their effectiveness is limited if there is a lack of large-scale data. Additionally, once the entity information in the graph is sufficient, traditional entity alignment methods ignore the complementarity between modal details, resulting in decreased alignment performance.

On the other hand, multi-modal entity alignment methods use various modal information (such as text, images, and audio) to perform entity alignment and compensate for the limitations of single-modality methods [26,27]. For example, MMEA proposes a new entity alignment framework that uses multi-modal data to connect the semantic relations between two entities and uses image information to supplement text information, improving the accuracy and robustness of entity alignment. EVA uses the visual similarity of entities to create a seed dictionary. It provides an unsupervised solution through the seed dictionary [28,29], but it needs to fully utilize the visual information’s uniqueness. HMEA [30] improves entity alignment performance by embedding structural images and other features into hyperbolic space. MultiJAF [31] uses entity structure, attributes, and visual information to form a feature matrix, combined with the similarity matrix of entity values, to perform entity alignment tasks and further improve the handling of multi-modal data. MSNEA and MCLEA [14] also use attribute and visual knowledge and explore the relations between modalities to reduce data noise, proposing different fusion schemes to improve entity alignment accuracy. However, these methods generally need more utilization of modal information and certain defects in selecting fusion schemes, resulting in significant room for improvement in overall performance. This paper proposes a multi-modal joint entity alignment framework to effectively combine different modal features and perform reasonable feature fusion, thereby improving the final version.

3. Methodology

In this section, this paper first introduces a definition of the problem and then provides a detailed description of MEAIE.

3.1. Notation and Problem Definition

This section introduces the symbols used in this paper and defines the multi-modal entity alignment task. The multi-modal knowledge graph can be noted as

G = (E, R, A, N, I, T^{R}, T^{A}, T^{N}, T^{I}),

where E, R, A, N, and I represent the sets of entities, relations, attributes, numbers, and images, respectively.

T^{R} = {E, R, E}

,

T^{A} = {E, A, V}

,

T^{N} = {E, A, N}

, and

T^{I} = {E, I}

represent the sets of relation triples, attribute triples, numerical value triples, and entity–image pairs, respectively. The multi-modal entity alignment task is to find the matching entity pairs

L = {(e_{M}, e_{N}) | e_{M} \in E_{M}, e_{N} \in E_{N}}

that describe the same concept in the real world from two relatively independent knowledge graphs

G_{M}

and

G_{N}

, to align two different multi-modal knowledge graphs.

3.2. Framework Overview

In this section, this paper proposes the MEAIE model for multi-modal entity alignment, as shown in Figure 1.

The proposed model, MEAIE, for multi-modal entity alignment consists of two main modules. First is the multi-modal knowledge embedding module, which includes numerical information in addition to the existing structures, attributes, and images. This paper uses a graph attention network encoder to encode structural information and generate structural embeddings. Simultaneously, entity images are fed into a pre-trained visual model to obtain visual features. Regarding relationship embedding, enhanced representations of the head and tail entities are used to obtain relationship embeddings. We encode attributes into a multi-hot vector for attribute information to generate attribute embeddings while utilizing the obtained visual features to influence attribute weight allocation. Numerical embeddings extract numerical information from the entities and obtain numerical embeddings through high-dimensional space mapping operations. Finally, high-confidence modality embeddings are obtained by continuously comparing negative and positive sample sets through contrastive learning. The second module, the multi-modal knowledge fusion module, employs a novel method of multi-modal knowledge fusion. This method utilizes contrastive learning to minimize the distance between cross-modal knowledge in the shared space. At the same time, attention layers dynamically allocate weights to each modality, forming a holistic embedding representation. This method can improve the accuracy and efficiency of entity alignment, making the fusion of multi-modal knowledge more effective.

3.3. Multi-Modal Knowledge Embedding

3.3.1. Structure Embedding

Due to the similarity of the structures of aligned entities in multi-modal knowledge graphs, graph structure information is utilized for entity alignment tasks. This paper uses graph attention networks [13,15] to model the structural information of

G_{M}

and

G_{N}

directly. Entity

e_{i}

aggregates the hidden states of its neighbors

N_{j}

through self-loops

h_{i}

, represented as

h_{i} = σ (\sum_{j \in N_{j}} a_{i j} h_{j})

(1)

where

h_{j}

is the hidden state of entity

e_{j}

,

σ (\cdot)

denotes the ReLU non-linear operation, and

a_{i j}

represents the importance of entity

e_{j}

to

e_{i}

, calculated through self-attention:

a_{i j} = \frac{e x p (η (a^{T} [W h_{i} \oplus W h_{j}]))}{\sum_{u \in N_{i}} e x p (η (a^{T} [W h_{i} \oplus W h_{u}]))}

(2)

where

W \in R^{d \times d}

is the weight matrix, a is a learnable parameter, ⊕ represents concatenation operation,

η

represents ReLU non-linear operation. This paper applies Equation (1) independently to each of the K attention heads in parallel and then concatenates the resulting features to obtain the structure embedding

e_{i}^{g}

for entity

e_{i}

:

e_{i}^{g} {= ‖}_{k = 1}^{K} σ (\sum_{j \in N_{i}} a_{i j}^{k} h_{j})

(3)

where

a_{i j}^{k}

is the normalized attention coefficient obtained from the k-th attention calculation, and ‖ means splicing operation.

3.3.2. Visual Embedding

The visual features of the multi-modal knowledge graph have more intuitive and visualized knowledge, which can help the model better perform entity alignment tasks. In the model, since convolutional neural networks perform well in image recognition and classification, they can effectively extract semantic information from images. The image is fed into a pre-trained deep convolutional neural network visual model for image feature extraction, and the last fully connected layer and softmax layer are removed to obtain the entity’s image embedding

e_{i}^{v}

as follows:

e_{i}^{v} = W_{v} \cdot P V M (i) + b_{i}

(4)

where

e_{i}^{v}

represents the visual feature of entity

e_{i}

,

W_{v}

and

b_{i}

represent trainable matrices and bias terms, and

P V M (\cdot)

represents a pre-trained visual model.

3.3.3. Attribute Embedding

In our work, attribute feature embedding is essential because attribute knowledge can provide the names and values of an entity’s attributes. First, this paper extracts all attributes in the knowledge graph into a separate data file and then performs two sets of work. On the one hand, entity attributes are treated analogously to entity structure representation, ignoring the attribute values and extracting each attribute contained in the entity. When aligning entities from two different knowledge graphs, the two entities to be aligned may have similar attribute structures. Based on this, this work simulates the attribute structure for representation. On the other hand, entity attributes are represented as a multi-hot vector, and the entity’s attributes are separately encoded [22], for example,

e_{i}^{a v} = [a_{1} : v_{1}, \dots, a_{i} : v_{i}, \dots, a_{j} : v_{j}]

.

e_{i}^{a v}

represents the attribute features of entity

e_{i}

, including attributes

a_{i}

and values

v_{i}

. Subsequent entity attribute embedding generates attribute and value embeddings, adds a linear layer to average their embeddings, and maps them to a low-dimensional space:

e_{i}^{a v} = W_{a} \cdot A ‖ V + b_{a}

(5)

where

e_{i}^{a v}

represents the attribute embedding of the entity

e_{i}

,

W_{a}

represents the trainable weight matrix, A and V represent the attributes and values of entity

e_{i}

, respectively, and

b_{a}

represents the bias term.

To improve the inter-modal effects, the obtained visual features guide the weight allocation of entity attributes. Since entity attributes are usually sparse and heterogeneous, introducing attribute knowledge into the entity alignment task and treating weak attributes as equally influential as vital attributes can contaminate entity representation. Therefore, it is unreasonable to assign the same weight to attributes. Using visual representation as a unique feature to allocate weights to attributes, entity attribute features are represented as the sum of all weighted attribute embeddings corresponding to the entity:

w_{j} = \frac{e x p (e_{i}^{v T} e_{j}^{a v})}{\sum_{c = 1}^{k} e x p (e_{i}^{v T} e_{c}^{a v})}

(6)

e_{i}^{a} = \sum_{j = 1}^{k} w_{j} e_{j}^{a v}

(7)

where

w_{j}

represents the attention weight assigned to

e_{i}^{a}

, and

e_{i}^{a}

represents the enhanced attribute feature embedding of entity

e_{i}

.

3.3.4. Relation Embedding

As an essential component of the multi-modal knowledge graph, relations are crucial in multi-modal entity alignment tasks. Two entities that exhibit similar relations to other entities are likely to be similar. In this work, as the structural embedding uses a graph attention network to form the graph embedding, for simplicity and consistency, the modeling of relation triples is viewed as the embedding of the tail entity being infinitely close to the embedding of the head entity plus the embedding of the relation. Additionally, to increase cross-modal effects, entity features are enhanced through the generation of visual features to improve relation learning:

\vec{x} = W_{i} \cdot e_{x}^{v} + b_{x}, x \in h, t

(8)

where

{\vec{x} \in \vec{h}, \vec{t}}

represent entity vectors, representing the head and tail entities, respectively,

W_{i}

and

b_{x}

represent weight matrices and bias terms, and

e_{x}^{v}

represents the image feature. The visual information is fused with semantic information to enhance semantic representation, and the corresponding loss function is represented as

f (h, r, t) = ‖ \vec{h} + r - \vec{t} ‖

(9)

where r is the relational feature, and the final embedding is expressed as

e^{r}

. By extracting relation triples and continually forming positive and negative samples for contrastive learning, entity relation representation is enhanced, forming relation feature representations.

3.3.5. Numerical Embedding

This paper extracts numerical attribute triplets separately to form numerical knowledge embeddings. Numerical features can supplement cases where some entities between knowledge graphs cannot include equivalent entities. For example, for the entity DB: Johnson_County,_Iowa in

K G_{1}

, the goal is to find the equivalent entity FB: Johnson_County,_Iowa in

K G_{2}

. This paper first performs embedding based on structure, attribute, and image knowledge, forms a joint embedding by feature embedding of each piece of knowledge, calculates the similarity score between all candidate sets corresponding to this entity, and finds two very close similarity score candidate entities FB: Iowa_City,_Iowa and FB: Johnson_County,_Iowa in

K G_{2}

, which have similarity scores of 0.695 and 0.689, respectively. This means that the specified entities of

K G_{1}

do not preferentially match the consistent equivalent entities in

K G_{2}

. Still, suppose this paper adds numerical modalities to this, based on the numerical information provided by the numerical modal entities (populationDensity, 82.2397597695) and (areaLand, 1590252699.75). In that case, it can assist in quickly and correctly matching to FB:Johnson_County,_Iowa. Therefore the numerical modality as a powerful auxiliary modality can help us to identify equivalent entities accurately.

For numerical information feature processing, since numerical information is always sparse in knowledge graphs, the radial basis function is used to process the numerical information of entities. The radial basis function neural network can approximate any nonlinear function, can handle difficult-to-analyze regularities in the system, has good generalization ability, and can convert numerical information into embeddings in high-dimensional space:

ϕ (n_{(e^{g}, a_{i})}) = e x p (\frac{- (n_{(e^{g}, a_{i})}) - c_{i})^{2}}{σ_{i}^{2}})

(10)

where the radial basis function

n_{(e^{g}, a_{i})}

denotes the numerical information corresponding to the numerical triple,

a_{i}

denotes the attribute key,

c_{i}

represents the center of the radial kernel, and

σ_{i}^{2}

represents the variance. First, the numerical values of each entity’s numerical triplet are normalized, and then training is conducted in the radial basis function neural network. After training, this paper extracts the embedding of the attribute key of the numerical triplet and then concatenates it with the numerical vector obtained from the radial basis function neural network. The credibility of the numerical embedding is measured by the scoring function defined in Formula (11):

f_{n u m} (e^{g}, a, n) = - {∥e^{g} - t a n h (v e c (C N N (e^{a n})) W)∥}_{2}^{2}

(11)

where

e^{a n}

denotes the embedding of entity attribute keys combined with the numerical embedding generated in the corresponding radial basis neural network,

C N N

denotes the convolutional layer, and W indicates the fully connected layer. The features are then mapped as a vector into the embedding space called

e^{n}

. The loss function is

L_{n u m} = \sum_{(e^{g}, a, n) \in Z} l o g (1 + e x p (- f_{n u m} (e^{g}, a, n))

(12)

where Z represents the set of numerical triples in the numerical dataset. Since the aligned entities in the relevant numerical triplets represent the same objects in the real world, they have the same numerical features. This property is leveraged to promote the representation of numerical information during contrastive learning training.

3.4. Feature Processing Fusion Module

Two aspects of work have been done for the feature processing and fusion module. On one hand, contrastive learning is applied to the representation of the intra-modality in the feature processing fusion module to enhance the feature representation within each modality and learn the intra-modality dynamics for providing discriminative boundaries for each modality in the embedding space. The modality embeddings obtained after applying the intra-modality contrastive loss operation need to be more consistent, making it challenging to model the interactions between modalities during feature fusion. Therefore, the knowledge of joint embedding is reinserted into the single-modality embedding to allow the single modality to utilize other modalities’ embedding information better. A multi-modal contrastive learning module is set up in the feature processing fusion module, introducing contrastive loss. Positive and negative samples are set up for each modality to perform contrastive learning, minimizing the loss. This paper encodes similar representations for positive entity pairs and different representations for negative entity pairs using the loss formula shown below:

L_{c l} (E, E^{'}) = \frac{1}{2 N} \sum_{n = 1}^{N} Y d^{2} (e_{x}, e_{x}^{'}) + (1 - Y) m a x {(δ_{c l} - d (e_{x}, e_{x}^{'}), 0)}^{2}, x \in g, r, a, v

(13)

where Y represents the label of entity pairs, d denotes the cosine similarity of the entity embedding, N indicates the number of batch samples,

δ_{c l}

represents the margin hyperparameter,

e_{x} \in E

and

e_{x}^{^{'}} \in E^{^{'}}

represent the corresponding entities in two knowledge graphs

G_{M}

and

G_{N}

, and g, r, a, and v represent structure, relation, attribute, and image, respectively. The overall loss is defined as follows:

L = L_{c l}^{I} + L_{c l}^{r} + L_{c l}^{a} + L_{c l}^{v} + L_{c l}^{n}

(14)

On the other hand, after the above operations are completed, feature embedding fusion is needed for each modality. Considering that previous models concatenated the feature embeddings of each modality, which caused the same weight for each modality in the joint embedding, this may lead to poor entity alignment results due to the excessive influence of weak modalities [13]. To address this issue, this paper adds self-attention layers to dynamically allocate the weights of each modality during joint embedding, thereby avoiding the overwhelming influence of weak modalities. First, this paper generates the overall representation:

e_{a l l} = e_{g} ‖ e_{r} ‖ e_{a} ‖ e_{n} ‖ e_{v}

(15)

where

e_{a l l}

denotes overall representation,

e_{g}

,

e_{r}

,

e_{a}

,

e_{n}

, and

e_{v}

denote structure, relation, attribute, numerical, and visual representation, respectively, and ‖ denotes splicing operation. After generating the overall embedding, the joint embedding data are fed into the transformer module, and each attention head is operated according to the following equation:

q_{r} = W^{Q} e_{a l l}

(16)

k_{r} = W^{K} e_{a l l}

(17)

v_{r} = W^{V} e_{a l l}

(18)

e = s e l f - a t t e n t i o n (q_{r}, k_{r}, v_{r})

(19)

where

q_{r}

,

k_{r}

, and

v_{r}

are parameter matrices,

W^{Q}

,

W^{K}

, and

W^{V}

denote the respective weight matrix. Self-attention is the attention function, and e is the joint embedding after dynamic weight update and combination.

4. Experiments

4.1. Experimental Settings

Datasets. Experiments in this paper utilize three public knowledge graph datasets: FB15K, DB15K, and YG15K. FB15K (Freebase15K) is a knowledge graph dataset developed by Facebook AI Research and released in 2015, containing 14,951 entities, 592,213 relation triples, 29,395 attribute triples, and 13,444 images. The dataset includes real-world entities and relations such as people, organizations, locations, and times. DB15K (Deep Learning Benchmarking 15K) is a knowledge graph dataset developed by researchers from Leipzig University and released in 2016, containing 12,842 entities, 89,197 relation triples, 48,080 attribute triples, and 12,837 images. The dataset includes entities and relations from Wikidata relations such as “place_of_birth” and “place_of_death”. YG15K (YAGO3-SP Geospatial) is a knowledge graph dataset developed by researchers from the Max Planck Institute and released in 2017, containing 15,404 entities, 122,886 relation triples, 23,532 attribute triples, and 11,194 images. The dataset includes entities and relations from YAGO3 and GeoNames, where GeoNames is a geospatial entity library containing locations and geographical features worldwide. These datasets have been widely used in multi-modal entity alignment tasks because of their large scale and diverse domains, making them the most representative datasets for multi-modal entity alignment.

To ensure the effectiveness of the entity alignment task, the preparation stage of the experiment combines these three public datasets pairwise to form a diverse set of examples. These example datasets aim to cover various attributes, relations, and image information to provide sufficient diversity. These example datasets are used to measure the effectiveness of entity alignment, and their statistical data are shown in Table 1.

Evaluation Metrics. This paper evaluates all models using cosine similarity to calculate the similarity between two entities and Hits@n, MRR, and MR as evaluation metrics. Hits@n represents the accuracy of the top n entities ranked by cosine similarity, MR is the average rank of the correct entities, and MRR is the average reciprocal rank of the proper entities. Formulas for the three metrics are shown in Equations (20)–(22):

M R R = \frac{1}{| S |} \sum_{i = 1}^{| S |} \frac{1}{r a n k_{i}} = \frac{1}{| S |} (\frac{1}{r a n k_{1}} + \frac{1}{r a n k_{2}} + \dots + \frac{1}{r a n k_{| S |}})

(20)

M R = \frac{1}{| S |} \sum_{i = 1}^{| S |} r a n k_{i} = \frac{1}{| S |} (r a n k_{1} + r a n k_{2} + \dots + r a n k_{| S |})

(21)

H i t @ n = \frac{1}{| S |} \sum_{i = 1}^{| S |} I (r a n k_{i} n)

(22)

where S denotes the set of triples,

I (\cdot)

denotes the indicator function (if · is true, the function value is 1, otherwise the value is 0), and

r a n k_{i}

is the link prediction ranking of the i-th triple. Higher values of Hits@n and MRR indicate better entity alignment performance of the model, while a lower value of MR can also prove this point.

Implementation Details. The initial phase of the experiment started with a data pre-processing operation on the dataset. We performed a normalization operation on the image data in the dataset, using the Z-score normalization method to normalize all the images. This method calculates the mean and standard deviation of each pixel and transforms it into a distribution with mean 0 and standard deviation 1, which allows for better comparability of pixel values of the images, as well as better stability and convergence. In addition, the numerical information in the dataset is normalized so that the range of values is limited to [0,1]; the duplicate data and missing data in the dataset are carefully screened to remove these data to ensure the accuracy of the experiment.

This paper conducted all experiments on the two datasets with relevant parameter settings. First, this paper initialized the knowledge embeddings in the knowledge graph to limit the scope of subsequent operations. This paper set the embedding size for all models to 100 and used a mini-batch method with a batch size of 512. For each experiment, this paper trained the model for 1000 epochs and set the corresponding learning rates for learning. Additional experimental model parameters are shown in Table 2.

4.2. Existing Methods

To validate the effectiveness and advancement of our method, this paper needs to compare it with state-of-the-art entity alignment methods, which can be classified into traditional entity alignment methods and multi-modal entity alignment methods.

Traditional entity alignment methods include:

MTransE: Embeds different knowledge graphs into other embedding spaces to provide a transformation for aligning entities.
GCN-Align: Performs entity alignment by combining structure and entity attribute information through graph convolutional neural networks.
SEA [20]: Proposes a semi-supervised entity alignment method that aligns labeled entities and rich unlabeled entity information and improves knowledge graph embedding through adversarial training.

Multi-modal entity alignment methods include:

MMEA: Generates entity representations of relation knowledge, visual knowledge, and numerical knowledge and then maps the multi-modal knowledge embeddings from their respective embedding spaces to a common area for entity alignment.
EVA: Proposes the importance of visual knowledge and combines it with multi-modal information to form a joint embedding for entity alignment.
MultiJAF [31]: Introduces a separate numerical processing module and predicts entity similarity based on the similarity matrix formed by the numerical module, combined with knowledge embedding fused with structural attributes and visual knowledge.
MSNEA: Considers the importance of visual knowledge and uses it to influence the embeddings of other modalities and proposes a contrastive learning optimization model to improve the alignment effect.
MCLEA: Introduces separate encoders for each modality to form knowledge embeddings and proposes a contrastive learning scheme to establish interactions within and between modalities to improve entity alignment.

This paper obtained the results of these baselines by running the Github code with experimental settings conducted under default configurations.

4.3. Results and Analysis

4.3.1. Overall Results

Our MEAIE was compared with several state-of-the-art entity alignment methods to demonstrate the proposed model’s effectiveness and superiority. Table 3 and Table 4 show the performance of all methods trained with 20% alignment seeds on the combined datasets FB15K-DB15K and FB15K-YG15K.

Table 3 shows that MEAIE achieves remarkable results in entity alignment tasks by enhancing entity representations through cross-modal effects and adding dynamic modal weights. It is precisely based on all evaluation metrics. This excludes the MR evaluation metric, as it only considers the average ranking of entity matching without evaluating the accuracy of the model’s sorting of correctly matched entities. Thus, if a model ranks high for all entity pairs but ranks the correct matching entity lower, its MR score will be lower, but in reality, the model’s matching performance is not good. In contrast, MRR pays more attention to the accuracy of the model’s sorting of correctly matched entities, thus reflecting the model’s actual performance more accurately. MEAIE achieves good results on the FB15K-DB15K dataset. Compared with traditional entity alignment methods, MEAIE outperforms the state-of-the-art method SEA by 50%, 45%, 43%, and 49% on Hit@1, Hit@5, Hit@10, and MRR, respectively, demonstrating the significant improvement of cross-modal entity alignment over traditional entity alignment. Using auxiliary modalities in multi-modal knowledge graphs can enhance entity alignment performance, validating the importance of developing auxiliary modalities in entity alignment tasks. Compared with other multi-modal entity alignment methods, such as EVA, MSNEA, and MCLEA, the proposed MEAIE model performs the best in multi-modal entity alignment tasks. When providing 20% of training seeds, MEAIE outperforms the state-of-the-art baseline methods MCLEA and MSNEA, with at least a 1.5% improvement on Hit@1, at least a 1.6% improvement on Hit@5, at least a 2.9% improvement on Hit@10, and at least a 3.2% improvement on MRR, validating the novelty and effectiveness of the proposed MEAIE model.

All three models processed the numerical modality when comparing the MEAIE model with MMEA and MultiJAF. However, the other two models ignored the cross-modal effects and the impact of weak modalities, whereas this paper improved upon these points. It was found that the final experimental results showed an improvement of at least 14% in Hit@1, 15% in Hit@5, and 17% in Hit@10, as well as an increase of at least 6% in MRR. It demonstrates the necessity of introducing cross-modal enhancement mechanisms, adding attention layers, and the rationality of selecting modal knowledge and fusion methods. However, it was discovered during the experiment that some entity images were missing in the knowledge graph, causing these entities to lack visual knowledge and therefore affecting the final entity alignment performance due to the absence of visual features. This paper used a strategy of replacing visual features with zero vectors, which did not enhance the representation of entity relations or correctly assign attribute weights, resulting in a slight improvement in experimental results.

From Table 4, the proposed MEAIE achieves objectively good experimental results on the FB15K-YG15K dataset. The model’s Hit@1, Hit@5, Hit@10, and MRR scores are 46%, 63%, 69%, and 0.534, respectively. Compared to the FB15K-YG15K dataset, where the entity alignment performance of all model methods is generally lower, this is due to the heterogeneity and other factors of the two datasets’ structures. However, the MEAIE model still achieves state-of-the-art performance and significant improvement, demonstrating good generalization and robustness in dealing with heterogeneous data in multi-modal knowledge graph entity alignment. Additionally, it is observed that EVA’s performance on the FB15K-YG15K dataset has significantly declined. This is because its multi-modal fusion approach needed to be better applied to the FB15K-YG15K dataset, resulting in poor results. On the other hand, the MEAIE model improves the alignment performance by comparing learning and adding attention layers to fuse modal knowledge effectively.

4.3.2. Ablation Study

To investigate the impact of each component of the proposed MEAIE model on entity alignment, this section designed two sets of variables for ablation experiments: (1) MEAIE without modalities, including relation, attribute, visual, and numerical modalities, i.e., w/R, w/A, w/V, w/N; (2) MEAIE without attention mechanism, i.e., simply concatenating the joint embeddings without dynamic modal weights, i.e., w/DW. Figure 2 shows the experimental results.

The first set of variables reveals that every modality contributes to entity alignment. Notably, visual knowledge significantly impacts entity alignment, as evidenced by the substantial decrease in Hit@1, Hit@10, and MRR. This is because, in this paper, we leveraged visual knowledge to enhance entity relations and allocate attribute weights, introducing inter-modality effects. Thus, the impact of visual knowledge is the greatest among all variables, which is consistent with the characteristics of the proposed model. Concerning the additional numerical modality introduced in this paper, the experimental results showed a slight decrease in Hit@1, Hit@10, and MRR when the numerical modality was missing, further demonstrating the feasibility of adding a numerical modality.

In the second set of variables, this paper demonstrates that introducing an attention layer was beneficial for the entity alignment task. The main reason was to avoid the excessive influence of weak modalities, allowing potent modalities to occupy a higher weight and weak modalities to have a relatively smaller weight proportion, thereby further improving the effectiveness of entity alignment after completing the joint embedding. Similar effects were observed in the FB15K-YG15K dataset during the same ablation experiments, but this paper will only go into some detail here.

4.3.3. Seed Sensitivity

To evaluate the sensitivity of the MEAIE model to pre-aligned entities, based on existing research, this paper uses 20%, 50%, and 80% of the alignment seeds as training sets for the entity alignment task. Figure 3 displays the training results of the model for different alignment seed proportions on the FB15K-DB15K dataset. The experimental results show that the MEAIE model achieved excellent results in almost all metrics and ratios.

Specifically, in the experimental preparation phase, sensitivity experiments were conducted on the seed entity parameters of multi-modal entity alignment methods. Through experiments, it was found that MMEA exhibited relatively poor performance in training pre-aligned seeds. This was because the network structure of MMEA was fairly simple and had poor fitting ability, resulting in weak dependence on pre-aligned entities. MEAIE showed a significant improvement in Hit@1, Hit@10, and MRR compared to the MCLEA model, validating that the entity alignment performance of the MEAIE model gradually improves with the increase in training seed ratio. Furthermore, the graph shows that the MSNEA model had the most outstanding experimental results when the seed ratio reached 80%, with Hit@10 and MRR results even higher than the MEAIE model, indicating that the MSNEA model’s performance can only compare a high level with a high proportion of seed pairs, while the MEAIE model can perform well even with a limited number of pre-aligned entities.

5. Conclusions

Our work proposes a new attention-based multi-modal entity alignment model for entity alignment. The model utilizes the information from each modality of a multi-modal knowledge graph and encodes each modality using a specific encoder to form a single-modality embedding. To address the multi-modal effect, the model enhances the entity relations with visual knowledge, guides the attention allocation of attributes, enhances the features of each modality through contrastive learning, and finally forms a joint embedding by concatenating the embeddings of each modality. The model introduces self-attention layers to dynamically assign attention weights to each modality in the joint embedding, avoiding the excessive influence of weak modalities. The proposed model, called MEAIE, then utilizes the joint embedding to perform the entity alignment task, and experimental results demonstrate its effectiveness and superiority.

While the article provides valuable information in some aspects, it can only solve some problems. Specifically, when there is a lack of sufficient visual information on entities in the dataset, the conclusions of the article may not be accurate or reliable enough. Therefore, this work is based on a rich collection of images, which is necessary to draw more accurate and reliable conclusions. This also implies the need for a broader dataset and more experiments to verify the conclusions of the article to ensure its effectiveness and reliability. In future work, we plan to further improve the performance of the model by analyzing the dataset to identify entities lacking visual information and selecting high-quality images based on the provided visual information to help us address the cross-modal problem when converting them into visual features. In terms of data processing, we performed data cleaning and normalization on the experimental dataset. However, we did not explore the use of different data pre-processing techniques on the dataset to produce different results. Therefore, we will also improve in this aspect in the future.

Author Contributions

Conceptualization, S.Y. and Z.L.; methodology, S.Y., Z.L. and Q.L.; software, Z.L.; validation, S.Y., Z.L. and Q.L.; formal analysis, S.Y.; investigation, S.Y. and Z.L.; resources, S.Y. and Z.L.; data curation, S.Y., Z.L. and Q.L.; writing—original draft preparation, S.Y., Z.L. and Q.L.; writing—review and editing, S.Y., Z.L., Q.L. and J.G.; visualization, S.Y., Z.L. and Q.L.; supervision, S.Y. and J.G.; project administration, S.Y. and J.G.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program (grant number 2020AAA0108500), National Natural Science Foundation of China (grant number U1836118), Key Research and Development Program of Wuhan (grant number 2022012202015070), and Open Research Fund of Key Laboratory of Rich Media Digital Publishing, Content Organization and Knowledge Service (grant number ZD2022-10/05).

Data Availability Statement

The data supporting this study’s findings are available from MMKB at https://github.com/mniepert/mmkb (accessed on 21 March 2023) or upon request from the authors. The source code is available at https://github.com/850587600/MEAIE (accessed on 21 March 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, R.; Cao, X.; Zhao, Y.; Wan, J.; Zhou, K.; Zhang, F.; Wang, Z.; Zheng, K. Multi-modal Knowledge Graphs for Recommender Systems. In Proceedings of the CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, 19–23 October 2020; ACM: New York, NY, USA, 2020; pp. 1405–1414. [Google Scholar]
Yang, S.; Zhang, R.; Erfani, S.M.; Lau, J.H. UniMF: A Unified Framework to Incorporate Multimodal Knowledge Bases intoEnd-to-End Task-Oriented Dialogue Systems. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event, 19–27 August 2021; Morgan Kaufmann: San Mateo, CA, USA, 2021; pp. 3978–3984. [Google Scholar]
Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event, 19–27 August 2021; Morgan Kaufmann: San Mateo, CA, USA, 2021; pp. 4483–4491. [Google Scholar]
Yin, J.; Tang, M.; Cao, J.; You, M.; Wang, H.; Alazab, M. Knowledge-Driven Cybersecurity Intelligence: Software Vulnerability Coexploitation Behavior Discovery. IEEE Trans. Ind. Inform. 2023, 19, 5593–5601. [Google Scholar] [CrossRef]
You, M.; Yin, J.; Wang, H.; Cao, J.; Wang, K.N.; Miao, Y.; Bertino, E. A knowledge graph empowered online learning framework for access control decision-making. World Wide Web (WWW) 2023, 26, 827–848. [Google Scholar] [CrossRef]
Ge, C.; Liu, X.; Chen, L.; Zheng, B.; Gao, Y. Make It Easy: An Effective End-to-End Entity Alignment Framework. In Proceedings of the SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; ACM: New York, NY, USA, 2021; pp. 777–786. [Google Scholar]
Chen, M.; Tian, Y.; Yang, M.; Zaniolo, C. Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, 19–25 August 2017; Morgan Kaufmann: San Mateo, CA, USA, 2017; pp. 1511–1517. [Google Scholar]
Shen, L.; He, R.; Huang, S. Entity alignment with adaptive margin learning knowledge graph embedding. Data Knowl. Eng. 2022, 139, 101987. [Google Scholar] [CrossRef]
Wang, Z.; Lv, Q.; Lan, X.; Zhang, Y. Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 349–357. [Google Scholar]
Xiang, Y.; Zhang, Z.; Chen, J.; Chen, X.; Lin, Z.; Zheng, Y. OntoEA: Ontology-guided Entity Alignment via Joint Knowledge Graph Embedding. In Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1–6 August 2021; Findings of ACL. Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; Volume ACL/IJCNLP 2021, pp. 1117–1128. [Google Scholar]
Chen, L.; Li, Z.; Wang, Y.; Xu, T.; Wang, Z.; Chen, E. MMEA: Entity Alignment for Multi-modal Knowledge Graph. In Proceedings of the Knowledge Science, Engineering and Management—13th International Conference, KSEM 2020, Hangzhou, China, 28–30 August 2020; Proceedings, Part I. Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12274, pp. 134–147. [Google Scholar]
Liu, F.; Chen, M.; Roth, D.; Collier, N. Visual Pivoting for (Unsupervised) Entity Alignment. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021; AAAI Press: Palo Alto, CA, USA, 2021; pp. 4257–4266. [Google Scholar]
Chen, L.; Li, Z.; Xu, T.; Wu, H.; Wang, Z.; Yuan, N.J.; Chen, E. Multi-modal Siamese Network for Entity Alignment. In Proceedings of the KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; ACM: New York, NY, USA, 2022; pp. 118–126. [Google Scholar]
Lin, Z.; Zhang, Z.; Wang, M.; Shi, Y.; Wu, X.; Zheng, Y. Multi-modal Contrastive Representation Learning for Entity Alignment. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2572–2584. [Google Scholar]
Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep Multimodal Fusion by Channel Exchanging. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020; Volume 33, pp. 4835–4845. [Google Scholar]
Liu, Y.; Li, H.; García-Durán, A.; Niepert, M.; Oñoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal Knowledge Graphs. In Proceedings of the The Semantic Web—16th International Conference, ESWC 2019, Portorož, Slovenia, 2–6 June 2019; Proceedings. Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2019; Volume 11503, pp. 459–474. [Google Scholar]
Wang, M.; Wang, H.; Qi, G.; Zheng, Q. Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph. Big Data Res. 2020, 22, 100159. [Google Scholar] [CrossRef]
Wang, X.; Huang, T.; Wang, D.; Yuan, Y.; Liu, Z.; He, X.; Chua, T. Learning Intents behind Interactions with Knowledge Graph for Recommendation. In Proceedings of the WWW ’21: The Web Conference 2021, Virtual Event, 19–23 April 2021; pp. 878–887. [Google Scholar]
Zhu, H.; Xie, R.; Liu, Z.; Sun, M. Iterative Entity Alignment via Joint Knowledge Embeddings. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, 19–25 August 2017; Morgan Kaufmann: San Mateo, CA, USA, 2017; pp. 4258–4264. [Google Scholar]
Pei, S.; Yu, L.; Hoehndorf, R.; Zhang, X. Semi-Supervised Entity Alignment via Knowledge Graph Embedding with Awareness of Degree Difference. In Proceedings of the The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 3130–3136. [Google Scholar]
Sun, Z.; Wang, C.; Hu, W.; Chen, M.; Dai, J.; Zhang, W.; Qu, Y. Knowledge Graph Alignment Network with Gated Multi-Hop Neighborhood Aggregation. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; pp. 222–229. [Google Scholar]
He, F.; Li, Z.; Yang, Q.; Liu, A.; Liu, G.; Zhao, P.; Zhao, L.; Zhang, M.; Chen, Z. Unsupervised Entity Alignment Using Attribute Triples and Relation Triples. In Proceedings of the Database Systems for Advanced Applications—24th International Conference, DASFAA 2019, Chiang Mai, Thailand, 22–25 April 2019; Proceedings, Part I. Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2019; Volume 11446, pp. 367–382. [Google Scholar]
Zhang, Q.; Sun, Z.; Hu, W.; Chen, M.; Guo, L.; Qu, Y. Multi-view Knowledge Graph Embedding for Entity Alignment. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; Morgan Kaufmann: San Mateo, CA, USA, 2019; pp. 5429–5435. [Google Scholar]
Shi, Y.; Wang, M.; Zhang, Z.; Lin, Z.; Zheng, Y. Probing the Impacts of Visual Context in Multimodal Entity Alignment. In Proceedings of the Web and Big Data—6th International Joint Conference, APWeb-WAIM 2022, Nanjing, China, 25–27 November 2022; Proceedings, Part II. Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2022; Volume 13422, pp. 255–270. [Google Scholar]
Zhu, R.; Ma, M.; Wang, P. RAGA: Relation-Aware Graph Attention Networks for Global Entity Alignment. In Proceedings of the Advances in Knowledge Discovery and Data Mining—25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, 11–14 May 2021; Proceedings, Part I. Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2021; Volume 12712, pp. 501–513. [Google Scholar]
Shen, J.; Wang, C.; Gong, L.; Song, D. Joint Language Semantic and Structure Embedding for Knowledge Graph Completion. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1965–1978. [Google Scholar]
Yang, J.; Wang, D.; Zhou, W.; Qian, W.; Wang, X.; Han, J.; Hu, S. Entity and Relation Matching Consensus for Entity Alignment. In Proceedings of the CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, 1–5 November 2021; Demartini, G., Zuccon, G., Culpepper, J.S., Huang, Z., Tong, H., Eds.; ACM: New York, NY, USA, 2021; pp. 2331–2341. [Google Scholar]
Mao, X.; Wang, W.; Wu, Y.; Lan, M. From Alignment to Assignment: Frustratingly Simple Unsupervised Entity Alignment. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2843–2853. [Google Scholar]
Qi, Z.; Zhang, Z.; Chen, J.; Chen, X.; Xiang, Y.; Zhang, N.; Zheng, Y. Unsupervised Knowledge Graph Alignment by Probabilistic Reasoning and Semantic Embedding. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event, 19–27 August 2021; Morgan Kaufmann: San Mateo, CA, USA, 2021; pp. 2019–2025. [Google Scholar]
Guo, H.; Tang, J.; Zeng, W.; Zhao, X.; Liu, L. Multi-modal entity alignment in hyperbolic space. Neurocomputing 2021, 461, 598–607. [Google Scholar] [CrossRef]
Cheng, B.; Zhu, J.; Guo, M. MultiJAF: Multi-modal joint entity alignment framework for multi-modal knowledge graph. Neurocomputing 2022, 500, 581–591. [Google Scholar] [CrossRef]

Figure 1. Structure of MEAIE.

Figure 2. Ablation experiments on the FB15K-DB15K dataset.

Figure 3. Comparison results of different seed ratios.

Table 1. Sample dataset.

Dataset	KG	Entity	Relation Triple	Attribute Triple	Image	Seed
FB15K-DB15K	FB15K	14,951	592,213	29,395	13,444	12,846
FB15K-DB15K	DB15K	12,842	89,197	48,080	12,837	12,846
FB15K-YG15K	FB15K	14,951	592,213	29,395	13,444	11,199
FB15K-YG15K	YG15K	15,404	122,886	23,532	11,194	11,199

Table 2. Model parameters.

Parameter	Default Value
training set rate	0.5
number of epochs to train	1000
check_point	100
hidden_units	128,128,128
lr	0.005
batch_size	512
img_dim	100
attr_dim	100
num_dim	100

Table 3. Performance of entity alignment methods on FB15K-DB15K dataset (using 20% aligned seed for entity alignment performance).

Models	FB15K-DB15K
Models	Hit@1(%)	Hit@5(%)	Hit@10(%)	MRR
MtransE	0.365	1.514	2.532	0.013
GCN-Align	4.312	10.956	15.548	0.078
SEA	16.945	33.465	42.512	0.256
MMEA	26.482	45.133	54.107	0.357
EVA	55.591	66.644	71.587	0.609
MultiJAF	54.241	64.654	68.741	0.687
MSNEA	65.268	76.847	81.214	0.708
MCLEA	65.512	75.831	82.534	0.714
MEAIE	67.014	78.414	85.425	0.746

Table 4. Performance of entity alignment methods on FB15K-YG15K dataset (using 20% aligned seed for entity alignment performance).

Models	FB15K-YG15K
Models	Hit@1(%)	Hit@5(%)	Hit@10(%)	MRR
MtransE	0.312	0.977	1.832	0.012
GCN-Align	2.271	7.232	10.754	0.053
SEA	14.084	28.694	37.147	0.218
MMEA	23.391	39.764	47.999	0.317
EVA	10.257	21.667	27.791	0.164
MultiJAF	45.681	58.432	67.801	0.467
MSNEA	44.288	62.554	69.831	0.529
MCLEA	42.381	60.016	65.414	0.473
MANEA	46.014	63.744	69.817	0.534

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, S.; Lu, Z.; Li, Q.; Gu, J. A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement. Big Data Cogn. Comput. 2023, 7, 77. https://doi.org/10.3390/bdcc7020077

AMA Style

Yuan S, Lu Z, Li Q, Gu J. A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement. Big Data and Cognitive Computing. 2023; 7(2):77. https://doi.org/10.3390/bdcc7020077

Chicago/Turabian Style

Yuan, Song, Zexin Lu, Qiyuan Li, and Jinguang Gu. 2023. "A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement" Big Data and Cognitive Computing 7, no. 2: 77. https://doi.org/10.3390/bdcc7020077

APA Style

Yuan, S., Lu, Z., Li, Q., & Gu, J. (2023). A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement. Big Data and Cognitive Computing, 7(2), 77. https://doi.org/10.3390/bdcc7020077

Article Menu

A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Multi-Modal Knowledge Graph

2.2. Entity Alignment

3. Methodology

3.1. Notation and Problem Definition

3.2. Framework Overview

3.3. Multi-Modal Knowledge Embedding

3.3.1. Structure Embedding

3.3.2. Visual Embedding

3.3.3. Attribute Embedding

3.3.4. Relation Embedding

3.3.5. Numerical Embedding

3.4. Feature Processing Fusion Module

4. Experiments

4.1. Experimental Settings

4.2. Existing Methods

4.3. Results and Analysis

4.3.1. Overall Results

4.3.2. Ablation Study

4.3.3. Seed Sensitivity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI