A Multi-Modal Entity Alignment Method with Inter-Modal Enhancement

: Due to inter-modal effects hidden in multi-modalities and the impact of weak modalities on multi-modal entity alignment, a Multi-modal Entity Alignment Method with Inter-modal Enhancement (MEAIE) is proposed. This method introduces a unique modality called numerical modality in the modal aspect and applies a numerical feature encoder to encode it. In the feature embedding stage, this paper utilizes visual features to enhance entity relation representation and inﬂuence entity attribute weight distribution. Then, this paper introduces attention layers and contrastive learning to strengthen inter-modal effects and mitigate the impact of weak modalities. In order to evaluate the performance of the proposed method, experiments are conducted on three public datasets: FB15K, DB15K, and YG15K. By combining the datasets in pairs, compared with the current state-of-the-art multi-modal entity alignment models, the proposed model achieves a 2% and 3% improvement in Top-1 Hit Rate(Hit@1) and Mean Reciprocal Rank (MRR), demonstrating its feasibility and effectiveness.


Introduction
In recent years, multi-modal knowledge graphs have gradually emerged which express knowledge of the natural world in multiple forms such as text, image, and audio. Their emergence has driven the development of applications such as question-answering and recommender systems [1][2][3]. In addition, the application of knowledge graphs to various domains is discussed more broadly, for example, in the area of computer security [4,5]. Because the real world contains a vast scope of knowledge and most multi-modal knowledge graphs are incomplete, their knowledge can complement each other. Specifically, knowledge graphs are often composed of multiple types of information, such as entities, attributes, and relationships. However, because acquiring and maintaining this information is a complex and expensive task, most multi-modal knowledge graphs typically need to be completed. This means essential information, such as relationships between entities and the range of attribute values, should be included in the knowledge graph. This has led to limitations in the practical applications of knowledge graphs in the real world. For example, consider two commonly used knowledge graph datasets: DB15K and FB15K. In DB15K, if we want to obtain information on all actors who played the role of The Flash in the movies and their information, the dataset only contains information about the movies in which the actors appeared but needs to include basic information such as their hobbies. On the other hand, FB15K contains essential information on several generations of actors. By combining these two datasets, we could obtain basic information on several generations of actors and information on the movies in which they appeared. Therefore, effective integration of the helpful knowledge from various multi-modal knowledge graphs is crucial, which has made multi-modal entity alignment tasks a popular area of research [6].
Existing entity alignment methods for traditional knowledge graphs mainly explore the similarity of relations or graph structures. Methods based on translation-based embeddings, such as MTransE [7] and AMEAlign [8], mine the semantics of multiple relations for entity alignment, while GCN-Align [9] and OntoEA [10] model the global graph structure. However, their alignment accuracy could be higher, and differences in structure and semantics between different knowledge graphs further reduce their effectiveness. Recent research has shown that using multi-modal knowledge for entity alignment has performed well. Multi-modal knowledge can alleviate the problem of data sparsity, and combining different types of knowledge can reduce the impact of errors from singular knowledge and improve the model's robustness. For example, MMEA [11] uses a variety of entity relations, numerical values, and images to enhance entity alignment performance. At the same time, EVA [12] incorporates visual features and finds them crucial for multi-modal entity alignment. However, EVA only considers visual knowledge, and the absence of visual knowledge can significantly reduce entity alignment performance. MSNEA [13] proposes a modal embedding module considering relation, attribute, and image knowledge. After obtaining the feature representation of each modality, this paper performs modal feature fusion to mitigate data noise resulting from the absence of specific modalities and further improve entity alignment performance. However, in experiments, this paper found that more than the use of relation, attribute, and image, it is required to find complete equivalent entities. An entity in one knowledge graph may have structure and image features similar to a non-equivalent entity in another, causing alignment errors. Therefore, the effective use of additional modalities can improve the accuracy of entity alignment, and careful consideration is needed in the reasonable selection and service of additional modalities.
In addition, most existing methods for multi-modal entity alignment focus on directly merging or simply concatenating modal features [14] after using different modal embeddings without considering the cross-modal interactions during the modeling process. The interaction between different modal information plays a vital role in multi-modal representation [15]. So, how to introduce cross-modal effects in multi-modal entity alignment has always been a problem. Currently, most methods in multi-modal entity alignment directly form single-modality feature representations and send them to the feature fusion stage, ignoring the feature enhancement representation between modalities. The graph modality, such as entity attributes, usually has sparse and heterogeneous properties. If all attributes are assigned the same weight without considering their importance to the entity, it will introduce noise and degrade the entity alignment performance. Moreover, most methods treat modalities as equally crucial in the joint embedding process. Still, weak modalities provide limited information and contribute less to entity alignment, decreasing overall alignment accuracy.
In summary, this paper propose a multi-modal entity alignment method based on crossmodal enhancement to address the problems of missing auxiliary modalities, insufficient cross-modal effects, and weak modality influence. The main contributions of this paper are as follows: • To address the problem of missing modalities, this paper proposes to add a unique numerical modality based on existing additional modalities, such as structure, relation, and attribute, to improve the information on additional modalities. We extracted numerical triplets from the original dataset and sent the numerical information to the radial basis function network. We then concatenated the resulting feature vectors with attribute embeddings and combined them with entity embeddings to form numerical embeddings. In order to ensure the accuracy of the numerical embeddings, we generated negative numerical triplets by swapping aligned entities in the given positive numerical triplets. We used contrastive learning to improve the credibility of the embeddings. • To overcome the problem of insufficient cross-modal effects, this paper proposes a novel approach that utilizes pre-trained visual models to obtain visual features of entities and applies them to entity embeddings to enhance the representation of visual interaction relations. We also use visual feature vectors and apply attention mechanisms to allocate entity attribute weights, forming enhanced entity attribute features. Specifically, we first use existing visual models to extract the visual features of entities. These visual features are then concatenated with entity embeddings to form enhanced entity embeddings. Next, we use these enhanced entity embeddings to represent the visual interaction relations between entities, better utilizing visual information to infer relations between entities. Moreover, we also use visual feature vectors and apply attention mechanisms to allocate entity attribute weights. This way, we can adjust the attribute weights based on the visual features of entities, thereby better utilizing attribute information to infer relations between entities. By adopting this approach, we can more comprehensively and accurately describe the entity relation, enhancing knowledge graphs' application value. • To address the problem of the excessive influence of weak modalities, this paper proposes a method of dynamically allocating modal weights. Specifically, we dynamically calculate the importance of each modality in the current alignment task using attention mechanisms and neural networks, thus avoiding the over-influence of weak modalities. In modality calculation, we first represent each modality using embedding representations, then we use a multi-layer perception to calculate the importance score of each modality, and finally we use an attention mechanism to calculate the weighted sum of modalities to obtain the weighted modality embedding representation. Through this method, we can better utilize multi-modal information to improve the accuracy and efficiency of alignment while avoiding the over-influence of weak modalities.

Multi-Modal Knowledge Graph
As the form and quantity of knowledge continue to increase, researchers have proposed large-scale multi-modal knowledge graphs one after another. For example, MMKG [16] used additional forms of knowledge (mainly images) to construct a multi-modal knowledge graph. Similarly, Richpedia [17] tried various methods in 2020 to enrich the knowledge graph. Its goal is to improve the information of the knowledge graph by adding sufficient and diverse images to the text entities. Some studies also aim to add audio and other forms of knowledge to the knowledge graph to ensure its diversity. For instance, Oramas [18] uses a knowledge graph to provide information for a hybrid recommendation engine, which incorporates audio into the multi-modal knowledge graph to apply multi-modal knowledge graph techniques for music or audio recommendations. The emergence of multi-modal knowledge graphs has led to ongoing discussions on embedding modal knowledge features in knowledge graphs.

Entity Alignment
Currently, research on entity alignment can be divided into traditional entity alignment methods and multi-modal entity alignment methods. Traditional methods can be seen as a single-modality information-based entity alignment method. For example, IPTransE [19] learns the representation of entity relations on each knowledge graph and then maps the two embedded entities into the same low-dimensional space for the entity alignment task. SEA [20] proposes a semi-supervised entity alignment method that aligns labeled entities with unlabeled ones in the knowledge graph and optimizes the knowledge graph embedding through adversarial training. AliNet [21] and GCN-Align [22] are GNN-based entity alignment methods [23][24][25] that discover the correlations between entities in the embedding space to perform entity alignment and combine structural and attribute embedding to improve alignment accuracy. These traditional entity alignment methods are relatively easy to understand and implement and have high accuracy and stability when using highquality data. However, they generally require a large amount of data to train the model, and their effectiveness is limited if there is a lack of large-scale data. Additionally, once the entity information in the graph is sufficient, traditional entity alignment methods ignore the complementarity between modal details, resulting in decreased alignment performance.
On the other hand, multi-modal entity alignment methods use various modal information (such as text, images, and audio) to perform entity alignment and compensate for the limitations of single-modality methods [26,27]. For example, MMEA proposes a new entity alignment framework that uses multi-modal data to connect the semantic relations between two entities and uses image information to supplement text information, improving the accuracy and robustness of entity alignment. EVA uses the visual similarity of entities to create a seed dictionary. It provides an unsupervised solution through the seed dictionary [28,29], but it needs to fully utilize the visual information's uniqueness. HMEA [30] improves entity alignment performance by embedding structural images and other features into hyperbolic space. MultiJAF [31] uses entity structure, attributes, and visual information to form a feature matrix, combined with the similarity matrix of entity values, to perform entity alignment tasks and further improve the handling of multi-modal data. MSNEA and MCLEA [14] also use attribute and visual knowledge and explore the relations between modalities to reduce data noise, proposing different fusion schemes to improve entity alignment accuracy. However, these methods generally need more utilization of modal information and certain defects in selecting fusion schemes, resulting in significant room for improvement in overall performance. This paper proposes a multi-modal joint entity alignment framework to effectively combine different modal features and perform reasonable feature fusion, thereby improving the final version.

Methodology
In this section, this paper first introduces a definition of the problem and then provides a detailed description of MEAIE.

Notation and Problem Definition
This section introduces the symbols used in this paper and defines the multi-modal entity alignment task. The multi-modal knowledge graph can be noted as where E, R, A, N, and I represent the sets of entities, relations, attributes, numbers, and images, respectively.
represent the sets of relation triples, attribute triples, numerical value triples, and entityimage pairs, respectively. The multi-modal entity alignment task is to find the matching entity pairs L = {(e M , e N )|e M ∈ E M , e N ∈ E N } that describe the same concept in the real world from two relatively independent knowledge graphs G M and G N , to align two different multi-modal knowledge graphs.

Framework Overview
In this section, this paper proposes the MEAIE model for multi-modal entity alignment, as shown in Figure 1.
The proposed model, MEAIE, for multi-modal entity alignment consists of two main modules. First is the multi-modal knowledge embedding module, which includes numerical information in addition to the existing structures, attributes, and images. This paper uses a graph attention network encoder to encode structural information and generate structural embeddings. Simultaneously, entity images are fed into a pre-trained visual model to obtain visual features. Regarding relationship embedding, enhanced representations of the head and tail entities are used to obtain relationship embeddings. We encode attributes into a multi-hot vector for attribute information to generate attribute embeddings while utilizing the obtained visual features to influence attribute weight allocation. Numerical embeddings extract numerical information from the entities and obtain numerical embeddings through high-dimensional space mapping operations. Finally, high-confidence modality embeddings are obtained by continuously comparing negative and positive sample sets through contrastive learning. The second module, the multi-modal knowledge fusion module, employs a novel method of multi-modal knowledge fusion. This method utilizes contrastive learning to minimize the distance between cross-modal knowledge in the shared space. At the same time, attention layers dynamically allocate weights to each modality, forming a holistic embedding representation. This method can improve the accuracy and efficiency of entity alignment, making the fusion of multi-modal knowledge more effective.

Structure Embedding
Due to the similarity of the structures of aligned entities in multi-modal knowledge graphs, graph structure information is utilized for entity alignment tasks. This paper uses graph attention networks [13,15] to model the structural information of G M and G N directly. Entity e i aggregates the hidden states of its neighbors N j through self-loops h i , represented as where h j is the hidden state of entity e j , σ(·) denotes the ReLU non-linear operation, and a ij represents the importance of entity e j to e i , calculated through self-attention: where W ∈ R d×d is the weight matrix, a is a learnable parameter, ⊕ represents concatenation operation, η represents ReLU non-linear operation. This paper applies Equation (1) independently to each of the K attention heads in parallel and then concatenates the resulting features to obtain the structure embedding e g i for entity e i : where a k ij is the normalized attention coefficient obtained from the k-th attention calculation, and means splicing operation.

Visual Embedding
The visual features of the multi-modal knowledge graph have more intuitive and visualized knowledge, which can help the model better perform entity alignment tasks. In the model, since convolutional neural networks perform well in image recognition and classification, they can effectively extract semantic information from images. The image is fed into a pre-trained deep convolutional neural network visual model for image feature extraction, and the last fully connected layer and softmax layer are removed to obtain the entity's image embedding e v i as follows: where e v i represents the visual feature of entity e i , W v and b i represent trainable matrices and bias terms, and PV M(·) represents a pre-trained visual model.

Attribute Embedding
In our work, attribute feature embedding is essential because attribute knowledge can provide the names and values of an entity's attributes. First, this paper extracts all attributes in the knowledge graph into a separate data file and then performs two sets of work. On the one hand, entity attributes are treated analogously to entity structure representation, ignoring the attribute values and extracting each attribute contained in the entity. When aligning entities from two different knowledge graphs, the two entities to be aligned may have similar attribute structures. Based on this, this work simulates the attribute structure for representation. On the other hand, entity attributes are represented as a multi-hot vector, and the entity's attributes are separately encoded [22], for example, e av i = [a 1 : v 1 , . . . , a i : v i , . . . , a j : v j ]. e av i represents the attribute features of entity e i , including attributes a i and values v i . Subsequent entity attribute embedding generates attribute and value embeddings, adds a linear layer to average their embeddings, and maps them to a low-dimensional space: where e av i represents the attribute embedding of the entity e i , W a represents the trainable weight matrix, A and V represent the attributes and values of entity e i , respectively, and b a represents the bias term.
To improve the inter-modal effects, the obtained visual features guide the weight allocation of entity attributes. Since entity attributes are usually sparse and heterogeneous, introducing attribute knowledge into the entity alignment task and treating weak attributes as equally influential as vital attributes can contaminate entity representation. Therefore, it is unreasonable to assign the same weight to attributes. Using visual representation as a unique feature to allocate weights to attributes, entity attribute features are represented as the sum of all weighted attribute embeddings corresponding to the entity: where w j represents the attention weight assigned to e a i , and e a i represents the enhanced attribute feature embedding of entity e i .

Relation Embedding
As an essential component of the multi-modal knowledge graph, relations are crucial in multi-modal entity alignment tasks. Two entities that exhibit similar relations to other entities are likely to be similar. In this work, as the structural embedding uses a graph attention network to form the graph embedding, for simplicity and consistency, the modeling of relation triples is viewed as the embedding of the tail entity being infinitely close to the embedding of the head entity plus the embedding of the relation. Additionally, to increase cross-modal effects, entity features are enhanced through the generation of visual features to improve relation learning: where { x ∈ h, t} represent entity vectors, representing the head and tail entities, respectively, W i and b x represent weight matrices and bias terms, and e v x represents the image feature. The visual information is fused with semantic information to enhance semantic representation, and the corresponding loss function is represented as where r is the relational feature, and the final embedding is expressed as e r . By extracting relation triples and continually forming positive and negative samples for contrastive learning, entity relation representation is enhanced, forming relation feature representations.

Numerical Embedding
This paper extracts numerical attribute triplets separately to form numerical knowledge embeddings. Numerical features can supplement cases where some entities between knowledge graphs cannot include equivalent entities. For example, for the entity DB: Johnson_County,_Iowa in KG 1 , the goal is to find the equivalent entity FB: John-son_County,_Iowa in KG 2 . This paper first performs embedding based on structure, attribute, and image knowledge, forms a joint embedding by feature embedding of each piece of knowledge, calculates the similarity score between all candidate sets corresponding to this entity, and finds two very close similarity score candidate entities FB: Iowa_City,_Iowa and FB: Johnson_County,_Iowa in KG 2 , which have similarity scores of 0.695 and 0.689, respectively. This means that the specified entities of KG 1 do not preferentially match the consistent equivalent entities in KG 2 . Still, suppose this paper adds numerical modalities to this, based on the numerical information provided by the numerical modal entities (populationDensity, 82.2397597695) and (areaLand, 1590252699.75). In that case, it can assist in quickly and correctly matching to FB:Johnson_County,_Iowa. Therefore the numerical modality as a powerful auxiliary modality can help us to identify equivalent entities accurately.
For numerical information feature processing, since numerical information is always sparse in knowledge graphs, the radial basis function is used to process the numerical information of entities. The radial basis function neural network can approximate any nonlinear function, can handle difficult-to-analyze regularities in the system, has good generalization ability, and can convert numerical information into embeddings in highdimensional space: φ(n (e g ,a i ) ) = exp( −(n (e g ,a i ) ) − c i ) 2 where the radial basis function n (e g ,a i ) denotes the numerical information corresponding to the numerical triple, a i denotes the attribute key, c i represents the center of the radial kernel, and σ 2 i represents the variance. First, the numerical values of each entity's numerical triplet are normalized, and then training is conducted in the radial basis function neural network. After training, this paper extracts the embedding of the attribute key of the numerical triplet and then concatenates it with the numerical vector obtained from the radial basis function neural network. The credibility of the numerical embedding is measured by the scoring function defined in Formula (11): f num (e g , a, n) = − e g − tanh(vec(CNN(e an ))W) 2 2 (11) where e an denotes the embedding of entity attribute keys combined with the numerical embedding generated in the corresponding radial basis neural network, CNN denotes the convolutional layer, and W indicates the fully connected layer. The features are then mapped as a vector into the embedding space called e n . The loss function is where Z represents the set of numerical triples in the numerical dataset. Since the aligned entities in the relevant numerical triplets represent the same objects in the real world, they have the same numerical features. This property is leveraged to promote the representation of numerical information during contrastive learning training.

Feature Processing Fusion Module
Two aspects of work have been done for the feature processing and fusion module. On one hand, contrastive learning is applied to the representation of the intra-modality in the feature processing fusion module to enhance the feature representation within each modality and learn the intra-modality dynamics for providing discriminative boundaries for each modality in the embedding space. The modality embeddings obtained after applying the intra-modality contrastive loss operation need to be more consistent, making it challenging to model the interactions between modalities during feature fusion. Therefore, the knowledge of joint embedding is reinserted into the single-modality embedding to allow the single modality to utilize other modalities' embedding information better. A multi-modal contrastive learning module is set up in the feature processing fusion module, introducing contrastive loss. Positive and negative samples are set up for each modality to perform contrastive learning, minimizing the loss. This paper encodes similar representations for positive entity pairs and different representations for negative entity pairs using the loss formula shown below: where Y represents the label of entity pairs, d denotes the cosine similarity of the entity embedding, N indicates the number of batch samples, δ cl represents the margin hyperparameter, e x ∈ E and e x ∈ E represent the corresponding entities in two knowledge graphs G M and G N , and g, r, a, and v represent structure, relation, attribute, and image, respectively. The overall loss is defined as follows: On the other hand, after the above operations are completed, feature embedding fusion is needed for each modality. Considering that previous models concatenated the feature embeddings of each modality, which caused the same weight for each modality in the joint embedding, this may lead to poor entity alignment results due to the excessive influence of weak modalities [13]. To address this issue, this paper adds self-attention layers to dynamically allocate the weights of each modality during joint embedding, thereby avoiding the overwhelming influence of weak modalities. First, this paper generates the overall representation: e all = e g e r e a e n e v (15) where e all denotes overall representation, e g , e r , e a , e n , and e v denote structure, relation, attribute, numerical, and visual representation, respectively, and denotes splicing operation.
After generating the overall embedding, the joint embedding data are fed into the transformer module, and each attention head is operated according to the following equation: where q r , k r , and v r are parameter matrices, W Q , W K , and W V denote the respective weight matrix. Self-attention is the attention function, and e is the joint embedding after dynamic weight update and combination.

Experimental Settings
Datasets. Experiments in this paper utilize three public knowledge graph datasets: To ensure the effectiveness of the entity alignment task, the preparation stage of the experiment combines these three public datasets pairwise to form a diverse set of examples. These example datasets aim to cover various attributes, relations, and image information to provide sufficient diversity. These example datasets are used to measure the effectiveness of entity alignment, and their statistical data are shown in Table 1. Evaluation Metrics. This paper evaluates all models using cosine similarity to calculate the similarity between two entities and Hits@n, MRR, and MR as evaluation metrics. Hits@n represents the accuracy of the top n entities ranked by cosine similarity, MR is the average rank of the correct entities, and MRR is the average reciprocal rank of the proper entities. Formulas for the three metrics are shown in Equations (20)- (22): where S denotes the set of triples, I(·) denotes the indicator function (if · is true, the function value is 1, otherwise the value is 0), and rank i is the link prediction ranking of the i-th triple.
Higher values of Hits@n and MRR indicate better entity alignment performance of the model, while a lower value of MR can also prove this point. Implementation Details. The initial phase of the experiment started with a data preprocessing operation on the dataset. We performed a normalization operation on the image data in the dataset, using the Z-score normalization method to normalize all the images. This method calculates the mean and standard deviation of each pixel and transforms it into a distribution with mean 0 and standard deviation 1, which allows for better comparability of pixel values of the images, as well as better stability and convergence. In addition, the numerical information in the dataset is normalized so that the range of values is limited to [0,1]; the duplicate data and missing data in the dataset are carefully screened to remove these data to ensure the accuracy of the experiment.
This paper conducted all experiments on the two datasets with relevant parameter settings. First, this paper initialized the knowledge embeddings in the knowledge graph to limit the scope of subsequent operations. This paper set the embedding size for all models to 100 and used a mini-batch method with a batch size of 512. For each experiment, this paper trained the model for 1000 epochs and set the corresponding learning rates for learning. Additional experimental model parameters are shown in Table 2.

Existing Methods
To validate the effectiveness and advancement of our method, this paper needs to compare it with state-of-the-art entity alignment methods, which can be classified into traditional entity alignment methods and multi-modal entity alignment methods.
Traditional entity alignment methods include: • MTransE: Embeds different knowledge graphs into other embedding spaces to provide a transformation for aligning entities.
• GCN-Align: Performs entity alignment by combining structure and entity attribute information through graph convolutional neural networks. • SEA [20]: Proposes a semi-supervised entity alignment method that aligns labeled entities and rich unlabeled entity information and improves knowledge graph embedding through adversarial training.
Multi-modal entity alignment methods include: • MMEA: Generates entity representations of relation knowledge, visual knowledge, and numerical knowledge and then maps the multi-modal knowledge embeddings from their respective embedding spaces to a common area for entity alignment. • EVA: Proposes the importance of visual knowledge and combines it with multi-modal information to form a joint embedding for entity alignment. • MultiJAF [31]: Introduces a separate numerical processing module and predicts entity similarity based on the similarity matrix formed by the numerical module, combined with knowledge embedding fused with structural attributes and visual knowledge. • MSNEA: Considers the importance of visual knowledge and uses it to influence the embeddings of other modalities and proposes a contrastive learning optimization model to improve the alignment effect. • MCLEA: Introduces separate encoders for each modality to form knowledge embeddings and proposes a contrastive learning scheme to establish interactions within and between modalities to improve entity alignment.
This paper obtained the results of these baselines by running the Github code with experimental settings conducted under default configurations.

Overall Results
Our MEAIE was compared with several state-of-the-art entity alignment methods to demonstrate the proposed model's effectiveness and superiority. Tables 3 and 4 show the performance of all methods trained with 20% alignment seeds on the combined datasets FB15K-DB15K and FB15K-YG15K.  Table 3 shows that MEAIE achieves remarkable results in entity alignment tasks by enhancing entity representations through cross-modal effects and adding dynamic modal weights. It is precisely based on all evaluation metrics. This excludes the MR evaluation metric, as it only considers the average ranking of entity matching without evaluating the accuracy of the model's sorting of correctly matched entities. Thus, if a model ranks high for all entity pairs but ranks the correct matching entity lower, its MR score will be lower, but in reality, the model's matching performance is not good. In contrast, MRR pays more attention to the accuracy of the model's sorting of correctly matched entities, thus reflecting the model's actual performance more accurately. MEAIE achieves good results on the FB15K-DB15K dataset. Compared with traditional entity alignment methods, MEAIE outperforms the state-of-the-art method SEA by 50%, 45%, 43%, and 49% on Hit@1, Hit@5, Hit@10, and MRR, respectively, demonstrating the significant improvement of cross-modal entity alignment over traditional entity alignment. Using auxiliary modalities in multi-modal knowledge graphs can enhance entity alignment performance, validating the importance of developing auxiliary modalities in entity alignment tasks. Compared with other multi-modal entity alignment methods, such as EVA, MSNEA, and MCLEA, the proposed MEAIE model performs the best in multi-modal entity alignment tasks. When providing 20% of training seeds, MEAIE outperforms the state-of-the-art baseline methods MCLEA and MSNEA, with at least a 1.5% improvement on Hit@1, at least a 1.6% improvement on Hit@5, at least a 2.9% improvement on Hit@10, and at least a 3.2% improvement on MRR, validating the novelty and effectiveness of the proposed MEAIE model. All three models processed the numerical modality when comparing the MEAIE model with MMEA and MultiJAF. However, the other two models ignored the cross-modal effects and the impact of weak modalities, whereas this paper improved upon these points. It was found that the final experimental results showed an improvement of at least 14% in Hit@1, 15% in Hit@5, and 17% in Hit@10, as well as an increase of at least 6% in MRR. It demonstrates the necessity of introducing cross-modal enhancement mechanisms, adding attention layers, and the rationality of selecting modal knowledge and fusion methods. However, it was discovered during the experiment that some entity images were missing in the knowledge graph, causing these entities to lack visual knowledge and therefore affecting the final entity alignment performance due to the absence of visual features. This paper used a strategy of replacing visual features with zero vectors, which did not enhance the representation of entity relations or correctly assign attribute weights, resulting in a slight improvement in experimental results.
From Table 4, the proposed MEAIE achieves objectively good experimental results on the FB15K-YG15K dataset. The model's Hit@1, Hit@5, Hit@10, and MRR scores are 46%, 63%, 69%, and 0.534, respectively. Compared to the FB15K-YG15K dataset, where the entity alignment performance of all model methods is generally lower, this is due to the heterogeneity and other factors of the two datasets' structures. However, the MEAIE model still achieves state-of-the-art performance and significant improvement, demonstrating good generalization and robustness in dealing with heterogeneous data in multi-modal knowledge graph entity alignment. Additionally, it is observed that EVA's performance on the FB15K-YG15K dataset has significantly declined. This is because its multi-modal fusion approach needed to be better applied to the FB15K-YG15K dataset, resulting in poor results. On the other hand, the MEAIE model improves the alignment performance by comparing learning and adding attention layers to fuse modal knowledge effectively.

Ablation Study
To investigate the impact of each component of the proposed MEAIE model on entity alignment, this section designed two sets of variables for ablation experiments: (1) MEAIE without modalities, including relation, attribute, visual, and numerical modalities, i.e., w/R, w/A, w/V, w/N; (2) MEAIE without attention mechanism, i.e., simply concatenating the joint embeddings without dynamic modal weights, i.e., w/DW. Figure 2 shows the experimental results. The first set of variables reveals that every modality contributes to entity alignment. Notably, visual knowledge significantly impacts entity alignment, as evidenced by the substantial decrease in Hit@1, Hit@10, and MRR. This is because, in this paper, we leveraged visual knowledge to enhance entity relations and allocate attribute weights, introducing inter-modality effects. Thus, the impact of visual knowledge is the greatest among all variables, which is consistent with the characteristics of the proposed model. Concerning the additional numerical modality introduced in this paper, the experimental results showed a slight decrease in Hit@1, Hit@10, and MRR when the numerical modality was missing, further demonstrating the feasibility of adding a numerical modality.
In the second set of variables, this paper demonstrates that introducing an attention layer was beneficial for the entity alignment task. The main reason was to avoid the excessive influence of weak modalities, allowing potent modalities to occupy a higher weight and weak modalities to have a relatively smaller weight proportion, thereby further improving the effectiveness of entity alignment after completing the joint embedding. Similar effects were observed in the FB15K-YG15K dataset during the same ablation experiments, but this paper will only go into some detail here.

Seed Sensitivity
To evaluate the sensitivity of the MEAIE model to pre-aligned entities, based on existing research, this paper uses 20%, 50%, and 80% of the alignment seeds as training sets for the entity alignment task. Figure 3 displays the training results of the model for different alignment seed proportions on the FB15K-DB15K dataset. The experimental results show that the MEAIE model achieved excellent results in almost all metrics and ratios. Specifically, in the experimental preparation phase, sensitivity experiments were conducted on the seed entity parameters of multi-modal entity alignment methods. Through experiments, it was found that MMEA exhibited relatively poor performance in training pre-aligned seeds. This was because the network structure of MMEA was fairly simple and had poor fitting ability, resulting in weak dependence on pre-aligned entities. MEAIE showed a significant improvement in Hit@1, Hit@10, and MRR compared to the MCLEA model, validating that the entity alignment performance of the MEAIE model gradually improves with the increase in training seed ratio. Furthermore, the graph shows that the MSNEA model had the most outstanding experimental results when the seed ratio reached 80%, with Hit@10 and MRR results even higher than the MEAIE model, indicating that the MSNEA model's performance can only compare a high level with a high proportion of seed pairs, while the MEAIE model can perform well even with a limited number of pre-aligned entities.

Conclusions
Our work proposes a new attention-based multi-modal entity alignment model for entity alignment. The model utilizes the information from each modality of a multi-modal knowledge graph and encodes each modality using a specific encoder to form a singlemodality embedding. To address the multi-modal effect, the model enhances the entity relations with visual knowledge, guides the attention allocation of attributes, enhances the features of each modality through contrastive learning, and finally forms a joint embedding by concatenating the embeddings of each modality. The model introduces self-attention layers to dynamically assign attention weights to each modality in the joint embedding, avoiding the excessive influence of weak modalities. The proposed model, called MEAIE, then utilizes the joint embedding to perform the entity alignment task, and experimental results demonstrate its effectiveness and superiority.
While the article provides valuable information in some aspects, it can only solve some problems. Specifically, when there is a lack of sufficient visual information on entities in the dataset, the conclusions of the article may not be accurate or reliable enough. Therefore, this work is based on a rich collection of images, which is necessary to draw more accurate and reliable conclusions. This also implies the need for a broader dataset and more experiments to verify the conclusions of the article to ensure its effectiveness and reliability. In future work, we plan to further improve the performance of the model by analyzing the dataset to identify entities lacking visual information and selecting high-quality images based on the provided visual information to help us address the cross-modal problem when converting them into visual features. In terms of data processing, we performed data cleaning and normalization on the experimental dataset. However, we did not explore the use of different data pre-processing techniques on the dataset to produce different results. Therefore, we will also improve in this aspect in the future.