Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions

Wang, Chenchen; Chaomurilige,; Weng, Yu; Liu, Xuan; Liu, Zheng

doi:10.3390/info16070534

Open AccessArticle

Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions

by

Chenchen Wang

^1,2

,

Chaomurilige

^1,3,*,

Yu Weng

^1,2,*,

Xuan Liu

^1,2 and

Zheng Liu

^1,3

¹

Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance, Ministry of Education, Minzu University of China, Beijing 100081, China

²

School of Information Engineering, Minzu University of China, Beijing 100081, China

³

School of Chinese Ethnic Minority Languages and Literatures, Minzu University of China, Beijing 100081, China

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(7), 534; https://doi.org/10.3390/info16070534

Submission received: 19 May 2025 / Revised: 10 June 2025 / Accepted: 20 June 2025 / Published: 24 June 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Multimodal knowledge graph entity alignment is a key basic task of knowledge fusion and integration, which is used to identify entities with semantic equivalent but different representation forms in different knowledge graphs. Previous entity alignment research has mostly focused on encoding and utilizing basic features such as entity names and attributes; however, it is difficult to comprehensively capture the rich semantic information of entities by solely relying on these basic features. To effectively overcome this limitation, this paper proposes a fusion-optimized multimodal entity alignment method, FMEA-TD. Compared with previous work, this method makes full use of the textual description information in the knowledge graph to provide rich supplements for entity features, thereby better capturing the entity semantics and solving the problems faced by relying solely on the entity’s own features. FMEA-TD is able to effectively fuse the entity’s own information and text description information through multimodal cooperation confidence, establish the interaction mechanism between them, and thus promote mutual collaboration between different modalities, which enhances the model’s ability to understand the semantic text. Experimentally validated, FMEA-TD outperforms current state-of-the-art baseline methods on public knowledge graph datasets.

Keywords:

entity alignment; textual descriptive; multimodal fusion

1. Introduction

With the ongoing development of knowledge graph research, an increasing number of multimodal knowledge graphs, such as DBpedia [1] and YAGO [2], have been constructed and applied. These graphs integrate information from diverse modalities and languages, thereby establishing a crucial foundation for cross-source information sharing and advanced applications including knowledge retrieval, question answering, and machine translation [3,4,5]. Recent advances in incorporating multimodal data (e.g., images and audio) into knowledge graphs have spurred innovative approaches [6,7,8] in cross-modal knowledge representation and reasoning [9,10], ultimately demonstrating that multimodal knowledge graphs can significantly enhance the performance of various applications.

Since real-world knowledge is complex and extensive, a single knowledge graph often has missing information, making it difficult to comprehensively cover all types of knowledge. Indeed, we take advantage of the complementary nature of knowledge graphs from different sources in terms of their content to align entities in different knowledge graphs, thus enriching the coverage of knowledge graphs and optimizing the quality of knowledge graphs.

In previous studies, numerous multimodal entity alignment [11,12,13] approaches have been proposed, leveraging information from different modalities to construct more comprehensive and enriched knowledge graphs. For instance, EVA [11] presents a fully unsupervised framework that integrates visual cues with auxiliary information, providing strong visual signals for cross-knowledge graph entity alignment. Similarly, MCLEA [12] employs intra-modal contrastive learning and inter-modal alignment learning to encode entity representations using multiple modalities, thereby improving alignment performance. However, existing multimodal entity alignment methods still exhibit certain limitations. These methods mainly focus on the interactive relationship between different modalities, such as the association between visual and textual information, without making full use of the detailed textual description information provided by the text. Such descriptions often contain rich contextual information, which is essential for accurately understanding entities, their relationships, and their attributes. To further illustrate this point, we propose a multimodal entity alignment example, as shown in Figure 1. This example demonstrates the process of entity alignment of knowledge graphs in different languages of multimodal information. In this example, the text and visual information commonly seen in traditional alignment methods is included, and textual description information is also introduced to supplement the fine-grained semantics in the knowledge graph, thereby further improving the accuracy of entity alignment.

In order to make full use of and integrate textual description information, we propose a Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions (FMEA-TD). While prior methods such as MCLEA and MEAFE have explored multimodal entity alignment through contrastive learning and cross-modal representation techniques, they primarily rely on high-level features from modalities like images and names. These models often overlook the fine-grained semantic cues embedded in textual descriptions, which are crucial for disambiguating entities—especially in cases with sparse structural or visual information. Our proposed model, FMEA-TD, addresses this limitation by fully integrating textual descriptions into the alignment process through a novel confidence-based fusion mechanism and a joint optimization framework, making it a substantial advancement over existing models. Some entities in the knowledge graph may lack sufficient structured information (such as name, attributes) for entity alignment. By introducing description information, the model can cover more knowledge about these entities, thereby improving the comprehensiveness and completeness of alignment. Since the attention mechanism used in previous multimodal feature fusion approaches focuses more on the correlation between modalities (such as the degree of match between text and image semantics), it assigns weights by calculating the similarity between different modal features. For example, an erroneous description may accidentally match a certain image, causing attention to mistakenly assign it a high weight, which will introduce noise. In contrast, we propose a new multimodal fusion strategy based on the confidence mechanism that pays more attention to the reliability of the modality itself (i.e., the probability of availability of the current modal information), thereby ensuring that the fusion result relies on more reliable information. For example, if the image modality is clear and complete, its confidence can be close to 1, which dominates the modal fusion. Our contributions are as follows:

The FMEA-TD method focuses on leveraging the rich semantic information embedded in the textual descriptions of entities, including their attributes, relationships, and contexts. This textual information often provides valuable contextual support for other modalities, such as visual and speech data, helping the model to more accurately understand and represent the entity’s overall semantic characteristics.
For information fusion in the multimodal knowledge graph, we employ a confidence predictor to calculate the confidence level of each modality. The collaborative confidence level across modalities not only reflects the quality of information from each individual modality but also considers the influence of all modalities, dynamically adjusting their weights to enhance the model’s robustness and generalization capabilities.
FMEA-TD established a joint optimization of loss function (JOF), which improved the Inter-modal Alignment Loss (IAL). It not only considers the correlation between different modalities but also makes full use of the correlation between the rich semantic features contained in the entity textual description information and other modalities. By optimizing the loss function, FMEA-TD can effectively learn the deep association between the entity textual description information and other modal information, which helps the model to more accurately understand and represent the overall semantic features of the entity.

2. Problem Definition

In this section, we will detail our work on multimodal knowledge representation learning and multimodal entity alignment. We will give a comprehensive account of the proposed approach and its innovations from the perspective of problem definition and notation description. In the multimodal entity alignment work, entities refer to nodes of specific things (people, places, organizations, etc.) in the knowledge graph, and multimodal entity alignment refers to matching and associating the information of the same entity in different modalities (text, images, etc.) to establish the correspondence between them.

The mathematical form of the multimodal knowledge graph is denoted as

G = (E, N, A, R, V, S)

, where E stands for the set of entities in the knowledge graph, and

N, A, R, V, S

are the name, attribute, relationship, visual, and description sets of entities, respectively. The different multimodal knowledge graphs are defined as

G_{i} = (E_{i}, N_{i}, A_{i}, R_{i}, V_{i}, S_{i})

. Let us consider a cross-language knowledge graph as an example. If there exist multimodal knowledge graphs of different languages (Chinese and English, for example)

G_{z h}

and

G_{e n}

, We map the information of different modalities into the same semantic space by extracting the embedding, such that

I = \{(e_{z h}^{i}, e_{e n}^{j}) | e_{z h}^{i} \equiv e_{e n}^{j}, e_{z h}^{i} \in E_{z h}, e_{e n}^{j} \in E_{e n}\}

, at which point

e_{z h}^{i}

and

e_{e n}^{j}

are a pair of aligned entities.

\vec{E^{g}}, \vec{E^{n}}, \vec{E^{a}}, \vec{E^{r}}, \vec{E^{i}}, \vec{E^{s}}

in Figure 2 represent graph structure embeddings, embeddings of names, attributes, and relationships, visual embeddings, and textual description embeddings, respectively, which adjust the weights between modalities through dynamic confidence to generate a fusion embedding

\vec{E^{f}}

of information from multiple modalities.

3. Related Work

Research on entity alignment can be categorized into unsupervised entity alignment and supervised entity alignment.

Unsupervised entity alignment does not require manually labeled seeds; SEU [14] is pre-mapped into a unified semantic space through a machine translation system or cross-linguistic word embeddings, and its textual characterization of entities retains the basic graph convolution operation for feature propagation while abandoning complex neural networks, significantly improving efficiency and interpretability. PRASE [15] learns KG embeddings through a probabilistic reasoning system in a probabilistic system called PARIS via entity mappings to learn KG embeddings, and it feeds the generated entity mappings and embeddings back to PARIS for augmentation. EVA [11] combines visual information with other semantic information to provide a completely unsupervised solution by utilizing the visual similarity of entities to create an initial seed dictionary (visual pivot). UDCEA [16] utilizes a deep learning multilingual encoder combined with a machine translator to encode the knowledge graph text to reduce the dependence on labeled data. It treats the alignment task as a dichotomous matching problem, then it employs the re-exchange idea to complete the alignment.

Compared with unsupervised methods, supervised entity alignment methods can utilize labeled samples to train the model, which improves the accuracy and reliability of the alignment. EAST [17] proposes an iterative representation learning process that can increase the size of the training set and generate higher-quality structural representations using confident EA pairs selected from each round. Furthermore, it points out that structural and textual representations are essentially two complementary viewpoints of EA, suggesting that they should be integrated in order to convey a more comprehensive signal. MMEA [18] is a multimodal knowledge embedding method used to generate entity representations of relational, visual, and numerical knowledge, respectively. After obtaining these knowledge representations, the multiple representations of the different types of knowledge are integrated. BERT-INT [19] is an interaction model that utilizes only side information. It does not aggregate neighbors but instead computes interactions between neighbors so that fine-grained matching of neighbors can be captured. MCLEA [12] not only aligns entities within modalities but also takes into account inter-modal interactions by designing two different losses that together model both intra-modal and inter-modal relationships. While previous approaches have considered many different types of data such as graph neighborhoods, text, and images, they have rarely considered textual description information. However, textual description information can often provide rich contextual information and complement this multimodal information. The FMEA-TD approach proposed in this paper makes full use of this textual description information and effectively fuses these descriptions with the underlying data information, enabling the model to better characterize the textual semantic features of entities.

4. Methodology

We propose a multimodal entity alignment method with textual descriptions, namely Fusion-Optimized Multimodal Entity Alignment (FMEA-TD). This method achieves alignment and matching of the same entity across different information sources by rationally utilizing the entity’s multimodal information, including text, images, and descriptions, and by employing a reasonable multimodal feature fusion system.

To help readers better understand the overall workflow of FMEA-TD, we provide a schematic summary of the methodology (Figure 2). The model operates in three main stages: (a) multimodal features—including graph structures, raw textual information, textual descriptions, and images—are extracted from multimodal knowledge graphs and encoded into vector representations; (b) these embeddings are fused using a collaborative confidence mechanism that dynamically weighs each modality based on its reliability; and (c) the fused representation is optimized using a joint loss function composed of cross-modal alignment loss and similarity loss, which guides the model to align semantically equivalent entities across graphs.

4.1. Knowledge Embedding

Knowledge information from different modalities in multiple languages is mapped to a shared semantic space, including graph structure information, underlying textual information of the knowledge graph, textual description information, and visual features of entities. This heterogeneous information is fused [20] to better capture semantic associations among entities and improve the accuracy of cross-language and cross-modality entity alignment.

4.1.1. Graph Structure Embedding

Node embedding techniques based on Graph Attention Networks (GATs) [21,22] are applied to knowledge graphs to better capture the complex topology of relationships between entities. Since the importance of information embedded in the neighboring nodes of an entity varies, different weights must be assigned to them. The core idea of GAT is to use an attention mechanism to adaptively compute the importance weights between a node and its neighboring nodes, thereby enhancing the encoding of graph structural information in the node representation.

Specifically, GAT calculates an attention weight

a_{i j}

for each node i to indicate the importance of the neighboring node j to the central node i.

a_{i j} = s o f t m a x_{j} (e_{i j})

(1)

s o f t m a x_{j} (e_{i j}) = \frac{e x p (e_{i j})}{\sum_{k \in N_{i}} e x p (e_{i k})}

(2)

where

e_{i j} = a (W \vec{h_{i}}, W \vec{h_{j}})

.

\vec{h_{i}}

and

\vec{h_{j}}

are the feature vectors of the master node and the neighbor nodes, respectively, W is the attention weight parameter to be learned, and a is a scoring function based on the attention mechanism of a single-layer forward-propagation neural network. The attention score

e_{i j}

is normalized to obtain the final attention weight

a_{i j}

.

After obtaining all the required attentional weights

a_{i j}

, a weighted average is taken to obtain the embedding representation of the next layer under the center node i, where

N_{i}

denotes the set of neighboring nodes of node i, which includes all nodes directly connected to node i, and

σ

denotes a nonlinear activation function such as

R e L U

:

\vec{h_{i}^{'}} = σ (\sum_{j \in N_{i}} a_{i j} W \vec{h_{j}})

(3)

In order to obtain richer and more accurate neighborhood structure information, we use two-layer GAT (K = 2) to capture the multi-level topological information in the graph structure:

\vec{E_{i}^{g}} = \sum_{k = 1}^{K} σ (\sum_{j \in N_{i}} a_{i j}^{k} W \vec{h_{j}})

(4)

While the first layer of GAT aggregates the information of the direct neighbor nodes, the second layer of GAT aggregates the neighborhood features in the two-hop range to obtain a wider range of structural information. Ultimately, the output of the second layer is used as a structural embedding of the neighborhood of the node.

4.1.2. Text Embedding

In addition to graph structure information, text information is also an important source of information. This cross-language textual information includes entity names, attributes, and relationships, which can be used to enhance the representation of entities since vanilla GAT operates on unlabeled graphs and the knowledge graph contains a large amount of textual information. We use the pre-trained GloVe [23] model to obtain textual features by averaging strings over entity names and attributes, which can provide important semantic information about the entities, compensating for the lack of using only graph structure information.

The text features obtained through the GloVe pre-trained model are further fed into a simple feed-forward neural network layer to obtain the final text embedding representation:

\vec{E_{i}^{t}} = W_{t} \cdot u_{i}^{t} + b_{t}, t \in \{n, a, r\}

(5)

where

\vec{E_{i}^{t}}

denotes the embedding vector of entity i under modality t;

W_{t}

is a weight matrix for mapping the input vector

u_{i}^{t}

to the output vector

\vec{E_{i}^{t}}

; and

b_{t}

is a bias vector that provides an additional adjustment term for each modality.

4.1.3. Visual Embedding

The pre-trained model ResNet-152 [24] is used as a visual coder to extract the visual feature vectors of an image. We input image

v_{i}

corresponding to entity

e_{i}

into the pre-trained model. The output of the penultimate layer of the model is used as the feature vectors of that image, then these vectors are inputted into a feed-forward neural network that completes the mapping from the original image to the visual embedding.

\vec{E_{i}^{v}} = W_{v} \cdot P V M (v_{i}) + b_{v}

(6)

4.1.4. Textual Description Embedding

In order to extract rich semantic information from the text description of entities, we use the Sentence-BERT model [25] to generate sentence-level semantic embeddings. The original BERT model is mainly oriented to word-level tasks, and its output is not directly suitable for calculating the similarity between sentences. SBERT introduces a pooling operation based on the pre-trained BERT, and it fine-tunes it on the sentence comparison task so that it can directly generate embedding representations suitable for sentence-level semantic comparison. In the entity alignment task, we input the text descriptions of entities in different languages into the pre-trained SBERT model, obtain their corresponding semantic embedding representations, and calculate the similarity between the descriptions to assist in determining whether the entities refer to the same real-world object, as shown in Figure 3.

Before being encoded by the SBERT model, the textual descriptions are normalized using a preprocessing pipeline. This includes lowercasing all characters, removing special symbols, and collapsing excessive whitespace. These normalized sentences are then tokenized by the WordPiece tokenizer embedded within SBERT to ensure consistent subword representations across languages. SBERT applies mean pooling on the output of BERT [26]; that is, taking the average of the representations of all tokens in the last layer and aggregating multiple token representations of a sentence into a vector, represented by

\vec{E_{i}^{s}}

. It can capture sentence information well, not just the [CLS] token. The sentence vector generated in this way can more accurately capture the overall characteristics of the sentence, which is expressed in the semantic space as follows: similar sentences correspond to closer vectors, while the distance between unrelated sentences is farther. The BERT representations (u and v) of the two input sentences (Sentence A and Sentence B) and the absolute difference between them (

|u - v|

) are concatenated together as the input of the Softmax classifier. The Softmax classifier is the final output layer, which obtains the probability matching output of each entity, and it selects the candidate entity pair with the highest probability as the output. The output of the Softmax classifier is probability

q_{s} (u, v)

, which represents the probability of entity alignment in the current output:

q_{s} (u, v) = S o f t m a x (W \cdot (u, v, |u - v|) + b)

(7)

4.2. Feature Dynamic Fusion

In the past, multimodal feature fusion often used the attention mechanism to weigh and sum the different modal embeddings and automatically learn the importance of each modality in the entity representation to obtain effective fusion embeddings. However, the attention mechanism focuses on inter-modal correlation, and it assigns the weights by calculating the feature similarity, which is prone to introducing noise due to some accidental matches, thus reducing the accuracy of modal fusion. On the other hand, the confidence mechanism focuses on the reliability of the modalities themselves, assigning higher weights to modalities with high confidence to ensure that the results of modal fusion can depend on reliable information. Therefore, we propose a multimodal confidence collaborator method that can dynamically adjust the weights of feature fusion (see Figure 4). The method not only considers the confidence of the current modality but also comprehensively captures the relationship between entities and contextual information, thus enhancing the effectiveness of feature fusion.

4.2.1. Uni-Confidence

To perform the uni-confidence calculation, we use a confidence predictor to obtain the uni-confidence, where we use the probabilities of the true class labels instead of the weights in the attention mechanism given the original uni-confidence

C_{m}

, which gives us the normalized confidence.

U n i - C o n f i d e n c e^{m} = \frac{e^{\frac{C_{m}}{T}}}{\sum_{j \in M}^{M} e^{\frac{C_{j}}{T}}}

(8)

Here, T is the temperature parameter and M is the set of different modes.

4.2.2. Mul-Confidence

Since different uni-modal confidence levels interact across modalities to generate multimodal confidence levels that are interconnected with other modalities, we use the sum of the weights of the other modalities to construct the weights, which enable the different modalities to collaborate with each other.

M u l - C o n f i d e n c e^{m} = \frac{\sum_{i \neq m}^{M} log U n i - C o n f i d e n c e^{i}}{\sum_{j \in M}^{M} log U n i - C o n f i d e n c e^{j}}

(9)

As a result, we obtain a multimodal confidence collaborator, which fuses uni-confidence and mul-confidence into a final weight. The fusion method is

F i n a l - C o n f^{m} = U n i - C o n f i d e n c e^{m} + M u l - C o n f i d e n c e^{m}

. Knowledge from different modalities often provides complementary information. A multimodal confidence collaborator can capture and utilize this complementary information more comprehensively by integrating the confidence of each modality, enhancing the interaction between different modalities. Ultimately, we obtain multimodal fusion embedding as follows:

\vec{E_{i}^{f}} = \oplus_{m \in M}^{M} (\frac{e x p (e_{m})}{\sum_{j \in M} e x p (e_{j})} \cdot F i n a l - C o n f_{i}^{m} \cdot x_{i}^{m})

(10)

where

x_{i}^{m}

is the original representation (embedding) of entity x in modality m.

4.3. Joint Optimization of Loss Functions

4.3.1. Multimodal Entity Contrastive Loss

We use Contrastive Loss (CL) [27] (see Figure 2) to perform similarity judgment of entity pairs in different knowledge graphs using the known 1-to-1 entity alignment constraints. Except for the aligned entity pairs

I^{i} = \{(e_{1}^{i}, e_{2}^{i}) | e_{1}^{i} \equiv e_{2}^{i}, e_{1}^{i} \in E_{1}, e_{2}^{i} \in E_{2}\}

, the negative pairs are classified into two types, which are the unaligned pairs in the same knowledge graph

N_{1}^{i} = \{e_{1}^{j} | \forall e_{1}^{j} \in E_{1}, j \neq i\}

and unaligned pairs

N_{2}^{i} = \{e_{2}^{j} | \forall e_{2}^{j} \in E_{2}, j \neq i\}

in different knowledge graphs. These are two pairs of unaligned entities constrained to the joint embedding space to keep their embedding representations away from each other, thus improving the similarity of the aligned entities.

By encoding the inputs u or v into feature vectors,

τ_{m}

is a temperature coefficient used to adjust the distribution of similarity.

δ_{m} (u, v)

denotes the similarity score between two different inputs under mode m, defined as follows:

δ_{m} (u, v) = S o f t m a x (f_{m} {(u)}^{T} f_{m} (v) / τ_{1})

(11)

The probability distribution of positive entity pairs in different modes is as follows:

q_{m} (I^{i}) = \frac{δ_{m} (I^{i})}{δ_{m} (I^{i}) + \sum_{e_{1}^{j} \in N_{1}^{i}} δ_{m} (e_{1}^{i}, e_{1}^{j}) + \sum_{e_{2}^{j} \in N_{2}^{i}} δ_{m} (e_{1}^{i}, e_{2}^{j})}

(12)

CL can be expressed as follows:

L_{m}^{C L} = - E_{i \in B} log [\frac{1}{2} (q_{m} (I_{1}^{i}) + q_{m} (I_{1}^{i}))]

(13)

where

I_{1}^{i} = \{(e_{1}^{i}, e_{2}^{i}) | e_{1}^{i} \equiv e_{2}^{i}, e_{1}^{i} \in E_{1}, e_{2}^{i} \in E_{2}\}

and

I_{2}^{i} = \{(e_{2}^{i}, e_{1}^{i}) | e_{2}^{i} \equiv e_{1}^{i}, e_{1}^{i} \in E_{1}, e_{2}^{i} \in E_{2}\}

.

Specifically, for a pair of already aligned entities, we include both orientations of the entity pair in the set of positive examples.

There are also contrast losses for text description embeddings and contrast losses for joint embeddings:

L_{s}^{C L} = - E_{i \in B} log [\frac{1}{2} (q_{s} (I_{1}^{i}) + q_{s} (I_{1}^{i}))]

(14)

L_{f}^{C L} = - E_{i \in B} log [\frac{1}{2} (q_{f} (I_{1}^{i}) + q_{f} (I_{1}^{i}))]

(15)

q_{s} (\cdot)

and

q_{f} (\cdot)

are positive entity pair probability distributions for textual descriptions and joint features, respectively. This multimodal entity comparison loss can effectively use the existing entity alignment constraints to improve the entity alignment performance.

4.3.2. Cross-Modal Similarity Loss

In order to make the similarity between modalities increase, the embedded representations of different modalities are compared in terms of similarity, and the associations and correspondences between different modalities are learned to bring the knowledge of each modality in the knowledge graph closer (see Figure 2). We use Minkowski’s loss. Different types of data, such as text and images, have very different features and representations. These heterogeneous data often have a significant ‘modal gap’, which makes it difficult to compare and align them directly. Moreover, Minkowski’s loss is a very general metric that can be used to compare the similarity between different modalities of the embedded representations. Minkowski’s loss is a very versatile metric that can be used to align semantic representations of entities in different modalities into a common potential space, which not only takes into account the directional information of vectors but also the magnitude of the vector’s paradigm, better capturing the semantic correlation between heterogeneous data. It can not only enhance the consistency of the cross-modal knowledge representations but also reduce the semantic gap between different modalities, effectively by minimizing the loss function. We define the Minkowski loss as follows:

L_{s}^{M L} = {(\sum_{i \in B} {| \frac{1}{2} [v_{s} (I_{1}^{i}) - v_{m} (I_{1}^{i})] + \frac{1}{2} [v_{s} (I_{2}^{i}) - v_{m} (I_{2}^{i})] |}^{p})}^{\frac{1}{p}}

(16)

L_{f}^{M L} = {(\sum_{i \in B} {|\frac{1}{2} [v_{f} (I_{1}^{i}) - v_{m} (I_{1}^{i})] + \frac{1}{2} [v_{f} (I_{2}^{i}) - v_{m} (I_{2}^{i})]|}^{p})}^{\frac{1}{p}}

(17)

L_{m}^{M L} = L_{s}^{M L} + L_{f}^{M L}

(18)

where

v_{s} (\cdot)

denotes the embedding vector of the textual description, and

v_{f} (\cdot)

denotes embedding vectors for multimodal fusion. These capture the semantic information of the entity, including the contextual knowledge of entity attributes and relationships. This approach provides important information for entity representation, reduces the gap between the textual description and other modal features, and improves the consistency of the representation of the entity in different modalities. p is the Minkowski distance of the order (

p = 1

corresponds to the Manhattan distance and

p = 2

corresponds to the Euclidean distance).

The overall loss of the model is:

L = L_{s}^{C L} + L_{f}^{C L} + \sum_{m \in M} α_{m} L_{m}^{C L} + \sum_{m \in M} β_{m} L_{m}^{M L}

(19)

where

M = \{g, n, a, r, v\}

,

α_{m}

and

β_{m}

are used to balance the loss importance.

5. Experiments

5.1. Datasets

We evaluated our method on two benchmark datasets: the multimodal cross-lingual dataset DBP15K and the multimodal dataset FB15K-DB15K. DBP15K consists of three subsets: zh-en (Chinese–English), ja-en (Japanese–English), and fr-en (French–English). FB15K in FB15K-DB15K is sourced from Freebase, while DB15K is derived from the English portion of the DBP15K dataset. To assess the effectiveness of our model, we compared it with several state-of-the-art models on these two datasets. Unlike previous datasets, we enhanced these two datasets by incorporating textual description information to improve the performance of entity alignment tasks. The text description data in FB15K-DB15K is constructed based on public knowledge graphs (Wikidata and DBpedia). For instance, Wikidata provides comprehensive entity information that is used to obtain textual descriptions for entities in FB15K. We retrieve entity descriptions by making API requests based on entity IDs in FB15K. However, due to the absence of original textual description data, not all entities can be fully described. As a result, during the entity alignment process, some entities may lack their corresponding textual descriptions. Since we use two different datasets to verify the experimental results, namely DBP15K and FB15K-DB15K, the data structures in these two datasets are different, so we use different proportions of entity pairs as alignment seeds. Specifically, DBP15K is a dataset for cross-language knowledge graph alignment, and its entity alignment task mainly relies on the alignment of cross-language entities. Therefore, we only use

30 %

of the entity pairs as seed data to ensure effective training under limited alignment information while avoiding excessive dependence of model training on aligned entity pairs due to too many seed pairs. For FB15K-DB15K, it is based on a multimodal monolingual knowledge graph, which contains entity pairs from different fields, so we use

20 %

,

50 %

, and

80 %

of the entity pairs as seed data to measure the performance of the model under different amounts of alignment information.

5.2. Evaluation Metrics

In this paper, the evaluation metric

H i t s @ 1

represents a measure of the proportion of target entities appearing in the first position in the retrieval results, i.e., the accuracy rate, and

H i t s @ 10

represents a measure of the proportion of target entities appearing in the top 10 positions in the retrieval results, which represent the model’s ability to correctly identify the target entities.

M R R

is the Mean Reverse Rank, calculated as the inverse rank of the first correct result for each retrieval query mean value. It reflects the speed at which the model returns the correct answer. The higher the

M R R

value, the more the model tends to return the correct answer in the first few positions. The

M R R

is calculated using the following formula:

M R R = \frac{1}{Q} \sum_{i = 1}^{Q} \frac{q}{r a n k_{i}}

(20)

where Q represents the total number of queries, and

r a n k_{i}

represents the ranking position of the first relevant result of query i. We set random seeds to ensure the reproducibility of the experiments. Our model consists of three hidden layers, each containing 300 units. We employ a two-layer Graph Attention Network (GAT) with two attention heads per layer to compute the CL and ML losses across different modalities. We adopt a confidence-based modality fusion strategy to automatically assign varying weight coefficients to the different modalities. To prevent overfitting, we incorporate an early stopping mechanism that halts the training process if the experimental results do not improve significantly over ten consecutive rounds. These hyperparameter settings are carefully selected to build a well-optimized model and ensure its generalization performance through regularization techniques.

5.3. Competing Methods

To demonstrate the advanced nature of the model, we conducted comparative experiments on our FMEA-TD model with other models on two different datasets. The experimental results show that our model has the best performance.

Table 1 compares the performance of our model with previous models on multilingual datasets. The improvement in performance achieved by our model with respect to the best baseline MEAFE validates the significance of multilingual text description information and cross-modal similarity loss in improving multimodal entity alignment.

The compared models include the use of a graph structure through TransE variants (MTransE [28], BootEA [29]), the use of a graph structure using GCN (KECG [30], HMAN [31], HGCN [32]), GCN combined with multimodal features (EVA [11]), GCN combined with auxiliary information (BERT-INT [19]), and GAT combined with multimodal features (MCLEA [12], MEAFE [33]). Based on the comparison, our model has the most advanced baseline. Based on the experimental results, it can be concluded that our model uses richer textual descriptions; compared to MCLEA, the textual description information contains features such as entity names, attributes, etc., and it provides more detailed descriptions of these basic features, which can help the model to better understand and represent the semantic information of entities. Compared with other uni-modal alignment models, such as HMAN, the visual information can respond to the visual features of the entity, and the fusion of the information for the two modes can describe the entity more accurately and overcome the limitations of single-modal information. Compared to the BERT-INT model, our model uses only are only 1/8 of the parameters, and our model is able to better learn the semantic representation of text by fusing multilingual text information with cross-modal similarity loss without relying on specific entity description data. In contrast, the BERT-INT model relies heavily on fine-tuning the BERT model using characteristic entity description information to enhance the entity representations.

We progressively incorporated

20 %

,

40 %

,

60 %

, and

80 %

of textual description information to supplement the entity alignment process. As shown in the entity alignment accuracy bar chart (see Figure 5), the results clearly indicate that increasing the amount of textual description information significantly enhances the alignment accuracy.

Table 2 presents the performance of three models—MCLEA, MMEA, and FMEA-TD (ours)—in the entity alignment task, with evaluation metrics including Hit@1, Hit@10, and MRR for four different percentages of textual description data (

20 %

,

40 %

,

60 %

,

80 %

). Taking the scenario with a

20 %

training seed ratio as an example, the best baseline model MCLEA achieves a Hit@1 value of

32.6 %

with

20 %

textual description data, while our model achieves a Hit@1 value of

35.9 %

under the same conditions, representing a

10.12 %

performance improvement. When the proportion of textual description data increases to

80 %

, the Hit@1 value of our model rises to

77.3 %

. Such substantial growth highlights the critical role of textual description data in enhancing the accuracy of entity alignment.

Compared with the performance on the DBP15K dataset, the accuracy improvement of entity alignment by adding textual description information is more significant on multimodal monolingual datasets such as FB15K-DB15K. The remarkable enhancement in model accuracy indicates that textual descriptions play a more prominent positive role in entity alignment tasks under multimodal monolingual environments.

The reasons for the differences in results between the two different experimental datasets are as follows. The differences in performance between the two datasets stem from their inherent characteristics. In DBP15K (cross-lingual), knowledge graphs have similar structures and clearly aligned entity names, which already provide strong cues for alignment. Thus, adding textual descriptions offers limited additional benefit, as the structural and name information is usually sufficient for accurate matching. In contrast, FB15K-DB15K (cross-knowledge source) contains graphs with less structural similarity and more ambiguous or varied entity names. Here, basic features alone are often insufficient, making textual descriptions crucial for capturing rich semantic details and improving alignment. This explains why removing descriptions leads to a larger drop in performance on this dataset, highlighting the need to adapt fusion strategies based on dataset characteristics.

The above experiments clearly demonstrate that the positive impact of textual description information on the entity alignment task is significant and cannot be overlooked. Particularly in scenarios where data is scarce and alignment is challenging, textual descriptions provide a valuable supplement and support for improving entity alignment.

5.4. Effects of Parameters

We investigated the number of layers K of the graph attention network (GAT), different graph structure embedding models (GCN, GAT, and GAT2V [34])), and the effects of different temperature parameters on the experimental results during the multimodal knowledge fusion process (see Figure 6). These parameters had a significant impact on FMEA-TD. We used the AdamW optimizer with a learning rate of

5 \times 10^{- 4}

and a batch size of 512 to update the model parameters. To prevent overfitting, we applied an early stopping strategy, where training was terminated if the validation performance (Hits@1) did not improve for 10 consecutive evaluations.

When choosing the number of layers of the GAT network, we considered both the accuracy of the model and the resource consumption. With similar experimental results, we chose

K = 2

. During the selection process of graph neural networks, we found that GAT significantly outperformed GCN and GAT2V in the multimodal entity alignment task. GAT adaptively assigned weights to different neighbors through the attentional mechanism and was able to dynamically adjust the information propagation according to the importance of the neighboring nodes, thus capturing the alignment information more efficiently. In contrast, GCN uses fixed neighbor weights, which may not fully utilize the graph structure information. In addition, GAT is more capable of handling heterogeneous graphs (different types of nodes and edges), while GAT2V is not as efficient as GAT when faced with diversity and complexity. The temperature parameter is typically used in multimodal collaboration to regulate the smoothness of model outputs. When the temperature value is low, it sharpens the output distribution and increases the probability of high-confidence categories. Conversely, higher temperature values flatten the output distribution, thereby reducing the probability of high-confidence categories. In the end, we chose a temperature coefficient of 0.5.

5.5. Ablation Experiments

We conducted an exhaustive analysis of the ablation experiments for the contributions of the different components for graph structural features, visual features, underlying textual information, textual description information, cross-modal similarity loss, and multimodal confidence collaborator, as shown in Table 3.

After removing the visual information, text description information, and cross-modal similarity loss, the model performance declined. However, this change had a limited impact on the experimental results. When the base text information is removed, the model performance drops significantly. This indicates that the base text makes a greater contribution to the model. Different components all play a role in the model’s performance, but the role of text information is particularly prominent. It provides the model with the most fundamental and crucial semantic representation. In future experiments, we will consider making more comprehensive use of description information and adopt knowledge graph completion methods to enhance the model’s ability to align multimodal knowledge graphs.

6. Conclusions

Most recent mainstream alignment studies typically adopt deep learning-based methods to embed knowledge for different modalities, but these methods often fail to fully consider richer descriptive information. Therefore, this paper proposes an entity alignment method in the cross-lingual multimodal knowledge graph FMEA-TD based on textual descriptions. The main contributions are as follows: (1) Textual descriptions are used to provide rich complementary features for multimodal data, thereby capturing more comprehensive semantic information. (2) A multimodal confidence collaborator is introduced to assign dynamic weights for multimodal feature fusion through inter-modal collaboration, making better use of complementary information. (3) A joint optimization loss function is established to simultaneously consider the associations between each modality embedding, text description embedding, and fusion embedding, enabling the model to more accurately understand and represent the overall semantic features of entities. The strong performance of FMEA-TD demonstrates its potential in improving multilingual knowledge retrieval by accurately aligning semantically equivalent entities across different languages and modalities. In applications such as multilingual search and question answering, correct alignment ensures that user queries are matched with the most relevant and complete information, regardless of the language or modality of the underlying data. Furthermore, by leveraging textual descriptions and a confidence-aware fusion strategy, FMEA-TD is able to handle incomplete or noisy data more effectively, making it a robust solution for real-world dynamic and heterogeneous knowledge environments. In future work, we plan to explore methods for supplementing unstructured data such as images and textual descriptions in knowledge graphs, further improving the content of knowledge graphs and enhancing the coverage and accuracy of the graphs. This is expected to further improve the alignment performance, especially in handling multilingual and multimodal alignment tasks to achieve better results.

Author Contributions

C.W. (Chenchen Wang): Methodology, Investigation, Formal analysis, Writing—original draft. C.: Validation, Resources, Investigation, Writing—review & editing. Y.W.: Supervision, Project administration. X.L.: Supervision. Z.L.: Software. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities (2023QNYL24).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FMEA-TD	Fusion-Optimized Multimodal Entity Alignment
SBERT	Sentence-BERT
JOF	Joint optimization of loss functions

References

Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; VanKleef, P.; Auer, S.; et al. Dbpedia—A large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A large ontology from wikipedia and wordnet. J. Web Semant. 2008, 6, 203–217. [Google Scholar] [CrossRef]
Moussallem, D.; Ngonga Ngomo, A.C.; Buitelaar, P.; Arcan, M. Utilizing knowledge graphs for neural machine translation augmentation. In Proceedings of the 10th International Conference on Knowledge Capture, Marina Del Rey, CA, USA, 19–21 November 2019; pp. 139–146. [Google Scholar]
Srivastava, S.; Patidar, M.; Chowdhury, S.; Agarwal, P.; Bhattacharya, I.; Shroff, G. Complex question answering on knowledge graphs using machine translation and multi-task learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, 19–23 April 2021; Main Volume. pp. 3428–3439. [Google Scholar]
Tao, W.; Zhu, H.; Tan, K.; Wang, J.; Liang, Y.; Jiang, H.; Yuan, P.; Lan, Y. Finqa: A training-free dynamic knowledge graph question answering system in finance with llm-based revision. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vilnius, Lithuania, 8–12 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 418–423. [Google Scholar]
Li, M.; Zareian, A.; Lin, Y.; Pan, X.; Whitehead, S.; Chen, B.; Wu, B.; Ji, H.; Chang, S.F.; Voss, C.; et al. Gaia: A fine-grained multimedia knowledge extraction system. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 77–86. [Google Scholar]
Wen, H.; Lin, Y.; Lai, T.; Pan, X.; Li, S.; Lin, X.; Zhou, B.; Li, M.; Wang, H.; Zhang, H.; et al. Resin: A dockerized schema-guided cross-document cross-lingual cross-media information extraction and event tracking system. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Online, 6–11 June 2021; pp. 133–143. [Google Scholar]
Zhu, X.; Li, Z.; Wang, X.; Jiang, X.; Sun, P.; Wang, X.; Xiao, Y.; Yuan, N.J. Multi-modal knowledge graph construction and application: A survey. IEEE Trans. Knowl. Data Eng. 2022, 36, 715–735. [Google Scholar] [CrossRef]
Li, J.; Luo, R.; Sun, J.; Xiao, J.; Yang, Y. Prior bilinear-based models for knowledge graph completion. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vilnius, Lithuania, 8–12 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 317–334. [Google Scholar]
Yu, M.; Zuo, Y.; Zhang, W.; Zhao, M.; Xu, T.; Zhao, Y.; Guo, J.; Yu, J. Graph attention network with relational dynamic factual fusion for knowledge graph completion. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Vilnius, Lithuania, 8–12 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 89–106. [Google Scholar]
Liu, F.; Chen, M.; Roth, D.; Collier, N. Visual pivoting for (unsupervised) entity alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 4257–4266. [Google Scholar]
Lin, Z.; Zhang, Z.; Wang, M.; Shi, Y.; Wu, X.; Zheng, Y. Multi-modal contrastive representation learning for entity alignment. arXiv 2022, arXiv:2209.00891. [Google Scholar]
Li, Y.; Chen, J.; Li, Y.; Xiang, Y.; Chen, X.; Zheng, H.T. Vision, deduction and alignment: An empirical study on multi-modal knowledge graph alignment. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Mao, X.; Wang, W.; Wu, Y.; Lan, M. From alignment to assignment: Frustratingly simple unsupervised entity alignment. arXiv 2021, arXiv:2109.02363. [Google Scholar]
Qi, Z.; Zhang, Z.; Chen, J.; Chen, X.; Xiang, Y.; Zhang, N.; Zheng, Y. Unsupervised knowledge graph alignment by probabilistic reasoning and semantic embedding. arXiv 2021, arXiv:2105.05596. [Google Scholar]
Jiang, C.; Qian, Y.; Chen, L.; Gu, Y.; Xie, X. Unsupervised deep cross-language entity alignment. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Turin, Italy, 18–22 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 3–19. [Google Scholar]
Zeng, W.; Tang, J.; Zhao, X. Iterative representation learning for entity alignment leveraging textual information. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, 16–20 September 2019; Springer: Berlin/Heidelberg, Germany, 2020. Part I. pp. 489–494. [Google Scholar]
Chen, L.; Li, Z.; Wang, Y.; Xu, T.; Wang, Z.; Chen, E. Mmea: Entity alignment for multi-modal knowledge graph. In Proceedings of the Knowledge Science, Engineering and Management: 13th International Conference, KSEM 2020, Hangzhou, China, 28–30 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. Part I. Volume 13, pp. 134–147. [Google Scholar]
Tang, X.; Zhang, J.; Chen, B.; Yang, Y.; Chen, H.; Li, C. Bert-int: A bert-based interaction model for knowledge graph alignment. Interactions 2020, 100, e1. [Google Scholar]
Cao, B.; Xia, Y.; Ding, Y.; Zhang, C.; Hu, Q. Predictive dynamic fusion. arXiv 2024, arXiv:2406.04802. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Reimers, N. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1735–1742. [Google Scholar]
Chen, M.; Tian, Y.; Yang, M.; Zaniolo, C. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv 2016, arXiv:1611.03954. [Google Scholar]
Sun, Z.; Hu, W.; Zhang, Q.; Qu, Y. Bootstrapping entity alignment with knowledge graph embedding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; Volume 18. [Google Scholar]
Li, C.; Cao, Y.; Hou, L.; Shi, J.; Li, J.; Chua, T.S. Semi-Supervised Entity Alignment via Joint Knowledge Embedding Model and Cross-Graph Mode; Association for Computational Linguistics: Florence, Italy, 2019. [Google Scholar]
Yang, H.W.; Zou, Y.; Shi, P.; Lu, W.; Lin, J.; Sun, X. Aligning cross-lingual entities with multi-aspect information. arXiv 2019, arXiv:1910.06575. [Google Scholar]
Wu, Y.; Liu, X.; Feng, Y.; Wang, Z.; Zhao, D. Jointly learning entity and relation representations for entity alignment. arXiv 2019, arXiv:1909.09317. [Google Scholar]
Wang, H.; Liu, Q.; Huang, R.; Zhang, J. Multi-modal entity alignment method based on feature enhancement. Appl. Sci. 2023, 13, 6747. [Google Scholar] [CrossRef]
Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? arXiv 2021, arXiv:2105.14491. [Google Scholar]

Figure 1. An example of multimodal entity alignment, including traditional text information, visual information, and text description information. The double arrows represent the inter-language links (ILL) between knowledge graphs in two languages.

Figure 2. The FMEA-TD model data source is derived from multimodal knowledge graphs and includes a multimodal embedding module, a feature dynamic fusion module, and a joint optimization loss function. (a) Multimodal feature description and embedding, (b) the process of multimodal embedding fusion through multimodal collaborative confidence, (c) the optimization of fusion embedding through the cross-modal alignment loss (CL) and cross-modal similarity loss (ML).

Figure 3. The Sentence-BERT model is utilized to extract textual description embeddings and obtain probability distributions for positive entity pairs.

Figure 4. A multimodal confidence collaborator is used to calculate the weights corresponding to different modal embeddings during knowledge fusion, and the fusion embeddings are obtained by weighing and summing the embeddings of different modalities.

Figure 5. Alignment accuracy of FB15K-DB15K multimodal dataset at different text description information scales.

Figure 6. Effects of different parameters and models on experimental results.

Table 1. Comparison of performance with previous models on the multilingual dataset DBP15K.

Model	zh-en			ja-en			fr-en
Model	H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR
MTransE	0.308	0.614	0.364	0.279	0.575	0.349	0.244	0.996	0.335
BootEA	0.629	0.848	0.703	0.622	0.854	0.701	0.653	0.874	0.731
KEGC	0.478	0.835	0.598	0.490	0.844	0.610	0.486	0.851	0.610
HMAN	0.871	0.987	-	0.935	0.994	-	0.973	0.998	-
EVA	0.761	0.907	0.814	0.762	0.913	0.817	0.793	0.942	0.847
HGCN	0.720	0.857	0.768	0.766	0.897	0.813	0.892	0.961	0.917
BERT-INT	0.968	0.990	0.977	0.964	0.991	0.975	0.992	0.998	0.995
MCLEA	0.972	0.996	0.981	0.986	0.999	0.991	0.997	1.000	0.998
MEAFE	0.973	0.997	0.982	0.987	0.999	0.992	0.997	1.000	0.998
FMEA-TD (ours)	0.979	0.998	0.986	0.991	1.000	0.995	0.997	1.000	0.998

Table 2. Comparison of performance with previous models on the cross-knowledge graph multimodal dataset FB15K-DB15K.

Model		20%			50%			80%
		H@1	H@10	MRR	H@1	H@10	MRR	H@1	H@10	MRR
	20%	0.326	0.573	0.402	0.594	0.704	0.665	0.763	0.893	0.801
MCLEA	40%	0.458	0.649	0.502	0.647	0.766	0.701	0.818	0.912	0.842
	60%	0.595	0.735	0.634	0.738	0.84	0.783	0.863	0.937	0.902
	80%	0.728	0.874	0.787	0.864	0.913	0.887	0.918	0.956	0.93
	20%	0.297	0.552	0.388	0.436	0.654	0.585	0.625	0.864	0.766
MMEA	40%	0.412	0.636	0.498	0.59	0.754	0.634	0.766	0.886	0.803
	60%	0.547	0.628	0.586	0.678	0.831	0.747	0.843	0.898	0.855
	80%	0.687	0.814	0.756	0.802	0.869	0.825	0.882	0.923	0.902
	20%	0.359	0.627	0.451	0.641	0.801	0.713	0.823	0.933	0.865
FMEA-TD	40%	0.507	0.727	0.582	0.722	0.87	0.769	0.875	0.954	0.899
	60%	0.648	0.822	0.708	0.812	0.918	0.851	0.905	0.966	0.927
	80%	0.773	0.903	0.82	0.906	0.966	0.927	0.948	0.984	0.963

Table 3. Experimental results of ablation on multilingual dataset in different directions.

	zh-en		en-zh		ja-en		en-ja		fr-en		en-fr
	H@1	MRR	H@1	MRR	H@1	MRR	H@1	MRR	H@1	MRR	H@1	MRR
w/o Gph	0.920	0.943	0.916	0.941	0.960	0.974	0.958	0.971	0.991	0.999	0.992	0.995
w/o Img	0.970	0.983	0.969	0.993	0.981	0.991	0.983	0.991	0.994	0.999	0.994	0.998
w/o Text	0.864	0.891	0.861	0.888	0.903	0.914	0.907	0.918	0.962	0.978	0.958	0.973
w/o Desc	0.971	0.980	0.970	0.979	0.981	0.990	0.980	0.988	0.994	0.993	0.991	0.995
w/o ML	0.976	0.987	0.972	0.983	0.988	0.993	0.988	0.993	0.997	0.999	0.997	0.999
w/o Fus	0.973	0.982	0.971	0.981	0.983	0.991	0.984	0.991	0.994	0.996	0.993	0.997
ours	0.979	0.986	0.976	0.985	0.991	0.995	0.989	0.994	0.997	0.999	0.997	0.999

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Chaomurilige; Weng, Y.; Liu, X.; Liu, Z. Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions. Information 2025, 16, 534. https://doi.org/10.3390/info16070534

AMA Style

Wang C, Chaomurilige, Weng Y, Liu X, Liu Z. Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions. Information. 2025; 16(7):534. https://doi.org/10.3390/info16070534

Chicago/Turabian Style

Wang, Chenchen, Chaomurilige, Yu Weng, Xuan Liu, and Zheng Liu. 2025. "Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions" Information 16, no. 7: 534. https://doi.org/10.3390/info16070534

APA Style

Wang, C., Chaomurilige, Weng, Y., Liu, X., & Liu, Z. (2025). Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions. Information, 16(7), 534. https://doi.org/10.3390/info16070534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fusion-Optimized Multimodal Entity Alignment with Textual Descriptions

Abstract

1. Introduction

2. Problem Definition

3. Related Work

4. Methodology

4.1. Knowledge Embedding

4.1.1. Graph Structure Embedding

4.1.2. Text Embedding

4.1.3. Visual Embedding

4.1.4. Textual Description Embedding

4.2. Feature Dynamic Fusion

4.2.1. Uni-Confidence

4.2.2. Mul-Confidence

4.3. Joint Optimization of Loss Functions

4.3.1. Multimodal Entity Contrastive Loss

4.3.2. Cross-Modal Similarity Loss

5. Experiments

5.1. Datasets

5.2. Evaluation Metrics

5.3. Competing Methods

5.4. Effects of Parameters

5.5. Ablation Experiments

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI