Next Article in Journal
A Comparative Study of Authoring Performances Between In-Situ Mobile and Desktop Tools for Outdoor Location-Based Augmented Reality
Previous Article in Journal
Evaluating the Effectiveness and Ethical Implications of AI Detection Tools in Higher Education
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion

1
School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China
2
Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi 830017, China
3
School of National Security Studies, Xinjiang University, Urumqi 830017, China
*
Author to whom correspondence should be addressed.
Information 2025, 16(10), 907; https://doi.org/10.3390/info16100907
Submission received: 12 August 2025 / Revised: 1 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

Abstract

Multi-modal knowledge graph completion (MMKGC) aims to complete knowledge graphs by integrating structural information with multi-modal (e.g., visual, textual, and numerical) features and leveraging cross-modal reasoning within a unified semantic space to infer and supplement missing factual knowledge. Current MMKGC methods have advanced in terms of integrating multi-modal information but have overlooked the imbalance in modality importance for target entities. Treating all modalities equally dilutes critical semantics and amplifies irrelevant information, which in turn limits the semantic understanding and predictive performance of the model. To address these limitations, we proposed a modality information aggregation graph attention network with adversarial training for multi-modal knowledge graph completion (MIAGAT-AT). MIAGAT-AT focuses on hierarchically modeling complex cross-modal interactions. By combining the multi-head attention mechanism with modality-specific projection methods, it precisely captures global semantic dependencies and dynamically adjusts the weight of modality embeddings according to the importance of each modality, thereby optimizing cross-modal information fusion capabilities. Moreover, through the use of random noise and multi-layer residual blocks, the adversarial training generates high-quality multi-modal feature representations, thereby effectively enhancing information from imbalanced modalities. Experimental results demonstrate that our approach significantly outperforms 18 existing baselines and establishes a strong performance baseline across three distinct datasets.

1. Introduction

Knowledge graphs (KGs) represent real-world knowledge in the form of structured fact triples (head entity, relation, and tail entity) and have profound research significance and broad application potential in fields such as question answering systems [1], recommendation systems [2], and intelligent search [3]. However, with the rapid development of artificial intelligence, the limitations of traditional uni-modal knowledge graphs [4] are becoming increasingly apparent when confronted with the growing complexity of multi-modal data and application scenarios in the modern digital landscape. Uni-modal knowledge graphs struggle to comprehensively represent information that includes various modalities such as textual and visual content. These multi-modal data often provide richer contextual cues and cognitive foundations for artificial intelligence systems. This challenge has prompted both academia and industry to actively explore and develop multi-modal knowledge graphs (MMKGs) [5]. MMKGs are an innovative knowledge representation tool that combines the structured knowledge of standard KGs with multi-modal data. By integrating diverse perceptual data, MMKGs provide a more comprehensive and multi-dimensional cognitive foundation for artificial intelligence systems, showcasing significant potential in multi-modal applications.
However, despite the considerable scale achieved by many publicly available MMKGs, the incompleteness of MMKGs remains a prominent challenge due to the slow accumulation of multi-modal corpora and the continuous emergence of complex relations and entities. This incompleteness directly limits the widespread application of MMKGs in intelligent systems, making it a critical issue that requires urgent attention. To address this problem, researchers have proposed the task of MMKGC [6]. This task incorporates multi-modal information into knowledge graph completion (KGC) [7] models to more comprehensively address the missing information in MMKGC. Specifically, multi-modal information can serve as supplementary data to enrich the representation of entity embeddings in the MMKGC task, thereby improving completion performance and resulting in more accurate predictions.
Current mainstream MMKGC methods, such as those in [8,9,10,11], typically embed multi-modal entity information into separate vector spaces and integrate these embeddings through operations like concatenation or averaging to enhance entity representation. However, despite performance improvements, these methods have significant limitations. First, they ignore the imbalance of modality contributions, as different modalities offer varying degrees of relevance and information density. Fixed fusion strategies, like concatenation or averaging, assume equal importance for all modalities, leading to suboptimal utilization of key modalities. Second, there is an underutilization of multi-modal knowledge, as quality disparities among data sources result in incomplete modal information, limiting the representation of multidimensional entity attributes. Furthermore, these methods primarily emphasize textual and visual modalities, neglecting the latent value of numerical data embedded within text, which is essential for improving reasoning and enhancing semantic representation.
To overcome the limitations discussed above, this paper proposes the modality information aggregation graph attention network with adversarial training for multi-modal knowledge graph completion (MIAGAT-AT). MIAGAT-AT consists of two core modules: the modality information aggregation (MIAGAT) module and the adversarial training (AT) module. The MIAGAT module dynamically adjusts the weight of each modality based on the correlation between modality embeddings and the target entity. The AT module enhances the representational capacity of modality embeddings by introducing a generative adversarial network (GAN) training strategy. Our main contributions are summarized as follows:
  • We propose a novel modality information aggregation graph attention network (MIAGAT) designed to dynamically integrate four modalities (structural, textual, visual, and numerical features). By assessing the importance of each modality relative to the target entity, MIAGAT employs an adaptive attention mechanism to optimize weight allocation, producing integrated entity embeddings with deeply enriched multi- modal features.
  • We propose an adversarial training (AT) module that generates synthetic adversarial entities as training examples, substantially enhancing the model robustness and generalization capability in complex multi-modal scenarios and concurrently strengthening its ability to comprehend and represent multi-modal knowledge with greater precision and effectiveness.
  • We evaluated the performance of MIAGAT-AT through comprehensive experiments and analyses on link prediction tasks across three public benchmarks. The experimental results demonstrate that MIAGAT-AT significantly outperforms 18 existing baseline models, demonstrating its effectiveness.

2. Related Work

2.1. Multi-Modal Knowledge Graph Completion

Current research in MMKGC aims to improve the representation of entities by integrating various data modalities, such as visual, textual, and structural features. Unlike traditional knowledge graph embedding (KGE) methods, which rely solely on structural triples, MMKGC approaches seek to leverage the complementary strengths of multiple modalities to enhance predictive accuracy and the richness of entity representations. Over the years, several methods have been proposed to address these challenges, each introducing innovative strategies to better integrate and utilize multi-modal information. One of the pioneering works in this field is IKRL [11], which incorporated the visual modality into knowledge graph embeddings by using attention mechanisms to integrate entity visual features. TBKGC [9] further advanced this by combining visual and linguistic modalities within a translation-based framework, allowing for more comprehensive multi-modal representations. TransAE [12] unified structural and multi-modal information through a multi-modal autoencoder, improving the fusion of inter-modal features. RSME [10] introduced a relation-sensitive mechanism that dynamically filters irrelevant visual information, achieving a balance between structural and multi-modal contributions. OTKGE [13] applied optimal transport theory, modeling multi-modal embedding alignment as a Wasserstein distance minimization problem, thereby preserving spatial consistency across modalities. VBKGC [14] integrated the pre-trained Visual-BERT model for deep multi-modal fusion and introduced twin negative sampling, a strategy specifically designed for multi-modal contexts, significantly enhancing multi-modal embedding alignment and representation. To further boost the performance of MMKGC, several enhanced negative sampling strategies have been proposed. MMRNS [15] leverages relation aware multi-modal attention to generate diverse and challenging negative samples, thereby improving embedding quality and representation robustness. MANS [16] introduced a modality-aware negative sampling technique, which overcomes the limitations of traditional methods in multi-modal knowledge graph embeddings by effectively aligning structural and multi-modal embeddings.

2.2. Adversarial Training in Knowledge Graph Completion

Adversarial training is a key technique for enhancing the robustness of machine learning models by incorporating adversarial examples during training. Widely applied across various domains, it effectively mitigates adversarial attacks and improves model reliability under perturbations. In MMKGC, adversarial training addresses challenges such as modality imbalance and noisy data. By generating adversarial samples to refine cross-modal embeddings, it strengthens the alignment and integration of structural, visual, and textual information, thereby improving prediction accuracy and representation learning. For instance, PUDA [17] applies adversarial training within a positive-unlabeled mini-max framework to tackle data sparsity, generating informative adversarial samples to boost model robustness. AdaMF-MAT [18] extends this approach to an adaptive multi-modal fusion framework, addressing modality imbalance by generating adversarial samples to enhance modality integration. NATIVE [19] advances the concept by employing collaborative adversarial training within a relation-guided, dual-adaptive-fusion framework, enabling efficient and flexible multi-modal integration for MMKGC. These methods underscore the crucial role of adversarial training in overcoming key MMKGC challenges, offering robust solutions for modality imbalance, noisy data, and data sparsity across diverse scenarios.

3. Methodology

In this section, we provide a comprehensive introduction to the proposed MIAGAT-AT model, focusing on three key components: modal feature extraction, the modality information aggregation (MIAGAT) module, and the adversarial training (AT) Module. The complete framework of the model is illustrated in Figure 1.

3.1. Problem Definition

A knowledge graph is rigorously defined as a directed graph G = { E , R , T } , where E and R represent the sets of entities and relations, respectively. The set T = { ( h , r , t ) h , t E , r R } consists of triples in the graph, where r R denotes the relation between the head entity h and the tail entity t. In the context of a MMKG, let M represent the set of modalities, encompassing various forms of modal information, such as texts, images, and numerical data. For each entity e E , its corresponding modality embedding is denoted as e m , where m M = { o , i , n } . Consequently, the embeddings for the text, image, and numerical modalities are represented by e o , e i , and e n , respectively. Furthermore, the graph structure itself is considered an intrinsic modality for each entity, with structural information embedded in the triples T. The structural embedding for an entity refers to its representation, which is derived through an embedding layer, and is denoted as e c . MMKGC methods seek to learn a fused representation of entities across modalities, which are subsequently embedded into a continuous, low-dimensional vector space. These embeddings are then leveraged in link prediction tasks, where the objective is to predict the missing head entity h or tail entity t in a given query ( ( h , r , ? ) or ( ? , r , t ) ). During the prediction phase, the entity set E serves as the candidate set for possible entities.

3.2. Modal Feature Extraction

To effectively leverage multi-modal information in MMKGC, encoding each modality and extracting its relevant features are essential. For the textual, visual, and numerical modalities, we employ pre-trained models as modality encoders to capture feature representations from each modality, which are then utilized for the fusion of multi-modal information. In the implementation of modality encoding, we adopt BEiT, which is based on the Vision Transformer (ViT) architecture, as the visual encoder. For the textual modality, we utilize SBERT (Sentence-BERT), which is built upon the BERT architecture, as the textual encoder. For the numerical modality, we apply BERT, leveraging its powerful contextualization capabilities, as the numerical encoder. The extraction of the raw feature f m for entity e in modality m can be formally expressed by the following general equation:
f m = 1 S m ( e ) s m , i S m ( e ) M E m ( s m , i )
where S m ( e ) denotes the set of modality m corresponding to entity e, s m , i represents a specific element within the set S m ( e ) , and M E m refers to the modality encoder associated with modality m. The general formulation of Equation (1) was adopted from NATIVE [19] because the role of its pre-trained models aligns with that in our work, serving to extract raw modal features.

3.3. Modality Information Aggregation Module

In MMKGC tasks, the imbalance of modality information necessitates a thorough analysis of each modality. Different modalities have varying informational values, making it essential to accurately evaluate their contributions to the target entity. Current methods often use fixed fusion strategies that treat all modalities equally, neglecting the differences in feature representations and their actual relevance to the target entity. To address the mentioned issues, we propose the modality information aggregation graph attention neural network. Serving as a modality information aggregation module, this network employs a multi-head attention mechanism to dynamically assign varying weights to each modality based on the correlation between the modalities and the target entity.
In this module, the initial embedding matrix of the entities is represented as H R N e × d , where the i-th row corresponds to the embedding of entity e i , N e denotes the number of entities, and d represents the feature dimension of the entity embeddings. The multi-head attention mechanism of the MIAGAT processes this matrix, calculating the attention weights for each modality with respect to the target entity. After the weighted aggregation, the updated entity embedding matrix H R N e × d is obtained.
First, as proposed by GAT [20], a learnable weight matrix W is used for linear transformations to map the target entity embedding e c and each modality embedding e m into a high-dimensional space. The high-dimensional representations of the target entity and modalities are then concatenated to form a joint feature vector:
e c m = W e c , W e m
where e c m represents the importance of modality (m) features to the target entity c, W R d × d , d is the dimension of the original features, and d is the dimension of the transformed features.
The attention mechanism employed in our approach is a single-layer feedforward neural network that is parameterized by a linear transformation vector ( a R 2 d ), with a LeakyReLU activation function applied to introduce non-linearity:
β c m = α m · L e a k y R e L U ( a ( e c m ) )
where β c m represents the absolute attention coefficient of each modality m, while α m is a learnable scaling factor used to dynamically adjust the contribution of each modality to the target entity.
To obtain the relative attention value of each modality, β c m is normalized using the Softmax function:
α c m = e x p ( α m · L e a k y R e L U a W e c W e m ) z M e x p ( α z · L e a k y R e L U a W e c W e z )
where · denotes the transpose operation, · represents feature concatenation, and z is a specific element in the modality set M.
After normalizing the modality attention coefficients, we weight and combine the modality features through a linear transformation to produce the final output for each node, where
e ˜ = σ m M α c m W e m
To enhance the learning stability of the self-attention mechanism, a multi-head attention mechanism is introduced. Therefore, the joint modality embedding of an entity can be represented as
e ˜ = σ 1 K k = 1 K m M α c m k W k e m
where σ denotes the non-linear activation function, 1 K represents the averaging operation, α c m k refers to the attention distribution computed by the k-th attention head, and W k represents the linear transformation matrix of the k-th attention head.
To strengthen the target entity’s representation and retain structural information, its structural embedding is combined with the multi-head attention output through a residual connection:
H = W T H + H
where H is the initial structural modality embedding, the weight matrix W T R E i × E f E i is the dimension of the initial entity embedding, and E f is the dimension of the final entity embedding.
To evaluate triplet validity, we use the RotatE model [21] as the scoring function, which rotates the embeddings of entities and relations in complex space to model symmetric, antisymmetric, and cyclic relations:
F ( h , r , t ) = h r t
where ∘ denotes the Hadamard product. During training, we use negative sampling to optimize model parameters, enhancing efficiency and distinguishing samples:
L k g c = ( h , r , t ) T log σ γ + F ( h , r , t ) i = 1 N p ( h i , r i , t i ) log σ F ( h i , r i , t i ) γ
where σ denotes the sigmoid function, γ is a fixed margin hyperparameter, and ( h i , r i , t i ) T with i = 1 , 2 , , N represents the N negative samples of ( h , r , t ) . p ( h i , r i , t i ) is the self-contradiction weight in the RotatE model, which is used to assess the reliability of adversarially generated triplets. Its formula is given by
p ( h i , r i , t i ) = e x p ( τ F ( h i , r i , t i ) ) j = 1 k e x p τ F ( h j , r j , t j )
where τ is a temperature parameter used to control the weight of negative triplets, balancing their impact on the training process.

3.4. Adversarial Training Module

The previous modality aggregation module generates the final target entity embeddings by dynamically adjusting and fusing modal embeddings. However, it only captures basic modality features and fails to enrich their expressive power with deeper semantic information. Inspired by generative adversarial networks, we propose a novel adversarial training (AT) module for MMKGC to address this limitation. The AT module consists of two components: the generator (G) and the discriminator (D).

3.4.1. Generator Design

The proposed generator G architecture consists of stacked residual blocks, each with two linear layers and a non-linear activation function. This design alleviates the vanishing gradient problem through skip connections, enhancing information flow. The input includes a noise vector for randomness and real multi-modal structured features for semantic diversity:
e g e n = G ( z , e r e a l ) = R B n R B n 1 R B 1 ( z , e r e a l )
where e g e n represents the generated modality-specific entity embedding, which has the same feature shape as the real entity embedding e r e a l ; z N ( 0 , 1 ) represents random noise; n is the number of residual blocks; and R B i represents the operation of the i-th residual block. The computation of each residual block can be uniformly represented as
R B x = S i L U W 2 · S i L U W 1 · x + b 1 + b 2 + x
where x represents the input features, W 1 and W 2 represent the weight matrices for the linear transformations, b 1 and b 2 are the corresponding bias terms, and S i L U represents the non-linear activation function applied to each layer.

3.4.2. Discriminator Choice

In multi-modal knowledge graph completion, approaches such as AdamF-MAT [18] and NativE [19] utilize score functions as discriminators. In these methods, the score function not only serves as a tool for assessing the credibility of triples but also as a core mechanism for adversarial enhancement. Following this strategy, our model adopts the score function from the RotatE model as the discriminator.
In our model, the discriminator evaluates the authenticity of samples by computing the scores of generated triples, determining whether they align with the inherent relations in the knowledge graph. For a given triple ( h , r , t ) , the generator G produces entity embeddings h and t , generating a set of triples, which forms the generated triple set P ( h , r , t ) . Thus, in the adversarial training framework, the loss function is defined as
L a t = ( h , r , t ) T F h , r , t + 1 P ( h , r , t ) P ( h , r , t ) F h , r , t

3.4.3. Optimization Objective

The optimization objective of the generator is to generate adversarial examples in such a way that the discriminator cannot accurately distinguish between the generated samples and the real samples. Specifically, the generator aims to minimize the discriminator’s probability of correctly classifying the generated samples while maximizing the probability that the discriminator incorrectly classifies the generated samples as real:
m i n L G = E z p Z ( z ) log D ( G ( z ) )
where G ( z ) denotes the generated sample by the generator G given the input noise z, D ( · ) represents the discriminator function, p Z ( z ) denotes the noise distribution of the input to the generator, E represents the expectation operator, and z refers to the noise vector.
The optimization objective of the discriminator is to continuously improve its ability to distinguish between real samples and generated samples, thereby providing effective feedback signals to the generator. Specifically, the discriminator aims to maximize the probability of correctly classifying real samples while minimizing the probability of misclassifying generated samples:
m a x L D = E x p r e a l x l o g D x E z p Z ( z ) l o g ( 1 D ( G ( z ) ) )
where x denotes an entity sampled from the real data distribution p r e a l ( x ) , D ( G ( z ) ) represents the probability that the discriminator classifies the generated sample G ( z ) , and z refers to the noise vector sampled from the latent space p Z ( z ) .

4. Experimental Setup

4.1. Datasets

To evaluate the model’s performance on the multi-modal knowledge graph completion task, three widely used benchmark datasets were selected for the experiment: DB15K [22], MKG-W [15], and MKG-Y [15]. DB15K is a multi-modal knowledge graph constructed based on DBpedia [23], which leverages the rich structured triples from DBpedia and integrates multi-modal data such as visual, textual, and numerical information, providing a more comprehensive and multi-dimensional representation of real-world entities. MKG-W and MKG-Y are derived from Wikidata [24] and YAGO [25], respectively. Detailed information about these datasets is provided in Table 1.

4.2. Baselines

To validate the effectiveness of the proposed model, a comparative analysis was conducted with current mainstream baseline models for KGC. The baseline methods include KGC models that focus solely on structural information, such as TransE [26], TransD [27], DistMult [28], ComplEx [29], and RotatE [21], as well as MMKGC models that integrate structural information with additional modality data, such as IKRL [11], TBKGC [9], TransAE [12], MMKRL [30], RSME [10], VBKGC [14], OTKGE [13], IMF [8], VISTA [31], and AdaMF-MAT [18]. Additionally, KGC models based on negative sampling, such as KBGAN [32], MANS [16], and MMRNS [15], were also considered. The experimental results for the uni-modal KGC models and MMRNS are adopted from the original MMRNS paper, while the data for IMF and VISTA in the MMKGC models are obtained from NATIVE [19]. For the remaining models, the data are sourced from AdamF-MAT [18].

4.3. Implemention Details

For each MMKG dataset, hyperparameter optimization was performed using a grid search strategy to determine the optimal configuration. The search space encompassed the following ranges: batch sizes 128 , 512 , 1024 , the number of training epochs 250 , 500 , 1000 , 1500 , the number of attention heads 1 , 2 , 3 , the number of residual blocks 2 , 4 , 6 , 8 , the number of negative samples 32 , 64 , 128 , and embedding dimensions 100 , 150 , 200 , 250 , 300 . In the final implementation, the embedding dimension was set to 250, the batch size to 1024, and the number of negative samples to 128. Both the learning rate and the regularization learning rate were fixed at 10−4, with a regularization coefficient of 0.0001. The model was trained for 1000 epochs with a missing modality rate of 0.8. The margin parameter was set to 12 for the DB15K dataset and to 4 for the MKG-W and MKG-Y datasets.

5. Experimental Results

5.1. Comparison with Existing Methods

Table 2 summarizes the comparative results between the proposed MIAGAT-AT framework and existing KGC baselines in terms of MRR and Hits@N metrics, where MRR (mean reciprocal rank) denotes the average reciprocal rank of the correct entities and Hits@N represents the proportion of correct entities ranked among the top N predictions. Baseline results are sourced as detailed in Section 4.2. Experimental results demonstrate that MIAGAT-AT consistently outperforms all baseline methods across all evaluation metrics on the three datasets. Compared to the conventional structure-based KGC model RotatE [21], MIAGAT-AT achieves the most significant improvements on the DB15K dataset, with gains of 7.7% in MRR, 10.6% in Hits@1, 5.8% in Hits@3, and 3.3% in Hits@10. The experimental results demonstrate that incorporating multi-modal information with conventional structural representations significantly enhances the model’s semantic comprehension of knowledge, leading to substantial performance improvements in KGC tasks.
Against the state-of-the-art multi-modal model AdaMF-MAT [18], MIAGAT-AT achieves further improvements on DB15K, especially in MRR and Hits@1, with improvements of 1.9% and 3.2%, respectively. These results demonstrate MIAGAT-AT’s superior capability in accurate link prediction. Even when processing the commonly employed modalities (structure, visual, and text modalities), MIAGAT-AT consistently outperforms all baseline models, further underscoring its robust capability in multi-modal integration. Furthermore, comparisons with adversarial training models such as MMKRL [30], KBGAN [32], and MMRNS [15] validate the effectiveness of the proposed adversarial module in enhancing cross-modal fusion and representation learning.

5.2. Generalization Experiment

We introduce an adversarial training (AT) module to enhance the expressiveness of modality embeddings. To evaluate the generalization capability of this module, we apply it to existing methods such as TBKGC [9], AdaMF [18], and NATIVE [19]. In the configuration without adversarial training, the model proceeds directly to the decoder for scoring and ranking to accomplish knowledge graph completion. The initial experimental data for these methods, prior to the incorporation of the adversarial training module, originates from our reproduction results. The experimental results are presented in Figure 2.
The experimental results demonstrate that incorporating the adversarial training module leads to significant performance improvements across all metrics for the three evaluated multi-modal knowledge graph completion models, with particularly notable gains in Hits@1 and Hits@3. The adversarial training module enhances model performance through a generator–discriminator adversarial game mechanism: the generator continuously produces high-quality adversarial samples, while the discriminator improves its ability to distinguish real samples from generated samples. This iterative adversarial process enables the generated samples to progressively approximate the real data distribution while substantially improving the model’s robustness against perturbations.

5.3. Case Study

To evaluate the performance of MIAGAT-AT in multi-modal information fusion tasks, we conducted missing triple prediction experiments on the MKG-Y dataset and compared it with mean modal aggregation methods. The experimental results demonstrate that MIAGAT-AT achieves superior prediction performance, generating higher triple prediction scores and effectively prioritizing semantically relevant candidate entities. Specifically, the top three predictions of MIAGAT-AT show strong alignment with the head entities and relations in the missing triples.
As shown in Table 3, all candidate entities in Case 1 are of the type “movie,” while in Table 4, Case 2 consists entirely of “geographical location” entities. In contrast, the mean modal aggregation method exhibits comparatively weaker prediction performance. For the same set of candidate entities, this method yields lower triple prediction scores than MIAGAT-AT and includes irrelevant entities in its top-three predictions, such as “J.K. Simmons” (a person) in Case 1. Although both methods successfully predict the correct answers for the missing triples in Cases 1 and 2, MIAGAT-AT achieves higher prediction scores, which represent distance metrics between positive and negative samples. This enhanced discriminative capability enables more accurate identification of missing triple entities.

5.4. Ablation Study

To evaluate the contribution of each component in the multi-modal information aggregation module, we conducted a comprehensive ablation study by systematically removing (1) individual modality inputs (visual, textual, and numerical inputs) and (2) Modal Scaling Coefficients. Our experiments revealed that the absence of any of these components led to a consistent degradation in link prediction performance.The experimental results are shown in Table 5. The introduction of numerical modality provides additional constraints that significantly improve the model’s prediction accuracy. Experimental results demonstrate that removing this modality leads to noticeable performance degradation, particularly with MRR decreasing by 1.4% and Hits@1 dropping by 1.8%. These findings confirm that the numerical modality effectively narrows the prediction space, guiding the model to make more precise inferences within reasonable bounds and consequently enhancing its link prediction capability.
In the adversarial training module, we evaluated the impact of different generator architectures on link prediction performance by (1) replacing standard residual blocks with dense residual connections and (2) substituting multi-layer perceptrons (MLPs) with convolutional neural networks (CNNs). The experimental results demonstrated that the combination of residual blocks and MLPs achieved the best performance. Specifically, the multi-hop connections in dense residual structures introduced additional computational complexity and redundant information, making it difficult for the model to maintain efficient gradient propagation and parameter optimization during training, thereby limiting performance improvements. While CNNs excel at capturing local features and spatial relationships, their ability to model cross-modal feature interactions as generators was suboptimal, as they failed to fully leverage their potential in extracting global information from multiple modalities.

6. Conclusions

In this paper, we propose an improved MMKGC model called MIAGAT-AT, which addresses two key limitations in existing multi-modal knowledge graph completion approaches: (1) the neglect of modality imbalance among entities and (2) insufficient utilization of multi-modal information. The MIAGAT-AT framework comprises two core components: the modality information aggregation (MIAGAT) module and the adversarial training (AT) module. The MIAGAT module dynamically allocates attention weights across different modalities and integrates multi-modal information to generate joint embeddings. The AT module enhances modality representation learning through a generative adversarial network-based training strategy. To comprehensively evaluate the proposed model, we conducted experiments on three distinct MMKGC datasets: MKG-Y, MKG-W, and DB15K. The results demonstrate that our model outperforms a range of baseline methods, including traditional structured knowledge graph completion models and multi-modal knowledge graph completion models, thereby validating its effectiveness and robustness.

Author Contributions

H.Y.: Conceptualization, data curation, and formal analysis. E.A.: Visualization, writing (original draft), and validation. S.I.: Supervision. A.H.: Funding acquisition, supervision, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Tianshan Talents Cultivation Program—Leading Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT-4 (OpenAI, the version as of June 2024) for language polishing and improving text fluency while maintaining academic rigor. All AI-generated content was rigorously reviewed and modified by the authors, who take full responsibility for the final publication.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

  1. Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-augmented generation with knowledge graphs for customer service question answering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 2905–2909. [Google Scholar]
  2. Li, C.; Cao, Y.; Zhu, Y.; Cheng, D.; Li, C.; Morimoto, Y. Ripple knowledge graph convolutional networks for recommendation systems. Mach. Intell. Res. 2024, 21, 481–494. [Google Scholar] [CrossRef]
  3. Li, J.; Peng, H.; Li, L. Sublinear smart semantic search based on knowledge graph over encrypted database. Comput. Secur. 2025, 151, 104319. [Google Scholar] [CrossRef]
  4. Liang, K.; Meng, L.; Liu, M.; Liu, Y.; Tu, W.; Wang, S.; Zhou, S.; Liu, X.; Sun, F.; He, K. A survey of knowledge graph reasoning on graph types: Static, dynamic, and multi-modal. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9456–9478. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, Z.; Chen, J.; Zhang, W.; Guo, L.; Fang, Y.; Huang, Y.; Zhang, Y.; Geng, Y.; Pan, J.Z.; Song, W.; et al. Meaformer: Multi-modal entity alignment transformer for meta modality hybrid. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3317–3327. [Google Scholar]
  6. Liang, S.; Zhu, A.; Zhang, J.; Shao, J. Hyper-node relational graph attention network for multi-modal knowledge graph completion. ACM Trans. Multim. Comput. Commun. Appl. 2023, 19, 1–21. [Google Scholar] [CrossRef]
  7. Wang, S.; Wei, X.; Nogueira dos Santos, C.N.; Wang, Z.; Nallapati, R.; Arnold, A.; Xiang, B.; Yu, P.S.; Cruz, I.F. Mixed-curvature multi-relational graph neural network for knowledge graph completion. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1761–1771. [Google Scholar]
  8. Li, X.; Zhao, X.; Xu, J.; Zhang, Y.; Xing, C. IMF: Interactive multimodal fusion model for link prediction. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2572–2580. [Google Scholar]
  9. Mousselly-Sergieh, H.; Botschen, T.; Gurevych, I.; Roth, S. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 225–234. [Google Scholar]
  10. Wang, M.; Wang, S.; Yang, H.; Zhang, Z.; Chen, X.; Qi, G. Is visual context really helpful for knowledge graph? A representation learning perspective. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2735–2743. [Google Scholar]
  11. Xie, R.; Liu, Z.; Luan, H.; Sun, M. Image-embodied knowledge representation learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017; pp. 3140–3146. [Google Scholar]
  12. Wang, Z.; Li, L.; Li, Q.; Zeng, D. Multimodal data enhanced representation learning for knowledge graphs. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN 2019), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
  13. Cao, Z.; Xu, Q.; Yang, Z.; He, Y.; Cao, X.; Huang, Q. Otkge: Multi-modal knowledge graph embeddings via optimal transport. Adv. Neural Inf. Process. Syst. 2022, 35, 39090–39102. [Google Scholar]
  14. Zhang, Y.; Zhang, W. Knowledge graph completion with pre-trained multimodal transformer and twins negative sampling. arXiv 2022, arXiv:2209.07084. [Google Scholar] [CrossRef]
  15. Xu, D.; Xu, T.; Wu, S.; Zhou, J.; Chen, E. Relation-enhanced negative sampling for multimodal knowledge graph completion. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM 2022), Lisbon, Portugal, 10–14 October 2022; pp. 3857–3866. [Google Scholar]
  16. Zhang, Y.; Chen, M.; Zhang, W. Modality-aware negative sampling for multi-modal knowledge graph embedding. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN 2023), Gold Coast, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar]
  17. Zhang, H.; Han, Q.; Sun, H.; Liu, C. Multi-modal knowledge graph representation based on counterfactual data enhanced learning link prediction. In Proceedings of the 2024 11th International Conference on Behavioural and Social Computing (BESC 2024), Okayama, Japan, 30 October–1 November 2024; pp. 1–7. [Google Scholar]
  18. Zhang, Y.; Chen, Z.; Liang, L.; Chen, H.; Zhang, W. Unleashing the power of imbalanced modality information for multi-modal knowledge graph completion. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20–25 May 2024; pp. 17120–17130. [Google Scholar]
  19. Zhang, Y.; Chen, Z.; Guo, L.; Xu, Y.; Hu, B.; Liu, Z.; Zhang, W.; Chen, H. Native: Multi-modal knowledge graph completion in the wild. In Proceedings of the 47th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024), Washington, DC, USA, 14–18 July 2024; pp. 91–101. [Google Scholar]
  20. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  21. Sun, Z.; Deng, Z.-H.; Nie, J.-Y.; Tang, J. RotatE: Knowledge graph embedding by relational rotation in complex space. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  22. Liu, Y.; Li, H.; Garcia-Duran, A.; Niepert, M.; Onoro-Rubio, D.; Rosenblum, D.S. MMKG: Multi-modal knowledge graphs. In Proceedings of the 16th Extended Semantic Web Conference (ESWC 2019), Portorož, Slovenia, 2–6 June 2019; pp. 459–474. [Google Scholar]
  23. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference (ISWC 2007), Busan, Republic of Korea, 11–15 November 2007; pp. 722–735. [Google Scholar]
  24. Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
  25. Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web (WWW 2007), Banff, AB, Canada, 8–12 May 2007; pp. 697–706. [Google Scholar]
  26. Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Adv. Neural Inf. Process. Syst. 2013, 26, 2787–2795. [Google Scholar]
  27. Ji, G.; He, S.; Xu, L.; Liu, K.; Zhao, J. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015), Beijing, China, 26–31 July 2015; pp. 687–696. [Google Scholar]
  28. Yang, B.; Yih, S.W.-T.; He, X.; Gao, J.; Deng, L. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  29. Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; Bouchard, G. Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; pp. 2071–2080. [Google Scholar]
  30. Lu, X.; Wang, L.; Jiang, Z.; He, S.; Liu, S. MMKRL: A robust embedding approach for multi-modal knowledge graph representation learning. Appl. Intell. 2022, 52, 7480–7497. [Google Scholar] [CrossRef]
  31. Lee, J.; Chung, C.; Lee, H.; Jo, S.; Whang, J. VISTA: Visual-textual knowledge graph representation learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 7314–7328. [Google Scholar]
  32. Cai, L.; Wang, W.Y. KBGAN: Adversarial learning for knowledge graph embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, LA, USA, 1–6 June 2018; pp. 1470–1480. [Google Scholar]
Figure 1. Overview of the MIAGAT-AT structure.
Figure 1. Overview of the MIAGAT-AT structure.
Information 16 00907 g001
Figure 2. Generalization experiment of the adversarial training module. (a) MRR, (b) Hits@1, (c) Hits@3, and (d) Hits@10.
Figure 2. Generalization experiment of the adversarial training module. (a) MRR, (b) Hits@1, (c) Hits@3, and (d) Hits@10.
Information 16 00907 g002
Table 1. Statistical information of the datasets.
Table 1. Statistical information of the datasets.
DatasetsEntitiesRelationsTrainValidTest
MKG-W15,00016934,19642764274
MKG-Y15,0002821,31026652663
DB15K12,84227979,22299029904
Table 2. Comparison of link prediction performance, where MIAGAT-AT* leverages structure, image, and text modalities, while MIAGAT-AT utilizes structure, image, numerical, and text modalities.
Table 2. Comparison of link prediction performance, where MIAGAT-AT* leverages structure, image, and text modalities, while MIAGAT-AT utilizes structure, image, numerical, and text modalities.
BaselinesDB15KMKG-WMKG-Y
MRRH@1H@3H@10MRRH@1H@3H@10MRRH@1H@3H@10
TransE [26]24.912.831.547.129.221.133.244.230.723.535.243.4
TransD [27]21.58.329.944.225.615.933.040.226.417.033.640.3
DistMult [28]23.014.826.339.621.015.922.330.925.019.327.836.0
ComplEx [29]27.518.431.645.424.919.126.736.728.722.332.140.9
RotatE [21]29.317.936.149.733.726.836.746.735.029.138.445.3
IKRL [11]26.814.134.949.132.426.134.844.133.230.434.338.3
TBKGC [9]28.415.637.049.931.525.334.043.234.030.535.340.1
TransAE [12]28.121.331.241.230.021.234.944.728.125.329.133.0
MMKRL [30]26.813.935.149.430.122.234.144.736.831.739.845.3
RSME [10]29.824.232.140.329.223.432.040.434.431.836.139.1
VBKGC [14]30.619.837.249.430.624.933.040.937.033.838.842.3
OTKGE [13]23.918.525.934.234.428.936.344.935.532.037.241.4
IMF [8]32.324.236.048.234.528.836.645.435.833.037.140.6
VISTA [31]30.422.533.645.932.926.135.445.630.524.932.441.5
AdaMF [18]35.125.341.152.935.929.039.048.438.634.340.645.8
KBGAN [32]25.79.937.051.929.522.234.940.629.722.834.940.2
MANS [16]28.816.936.649.330.924.933.641.829.025.331.434.5
MMRNS [15]29.717.936.751.034.127.437.546.835.930.539.145.5
MIAGAT-AT*35.726.041.453.236.730.039.848.938.834.740.745.9
MIAGAT-AT37.028.541.953.036.730.039.848.938.834.740.745.9
Table 3. Missing Triple Case 1.
Table 3. Missing Triple Case 1.
Missing Triple: (Matt Damon, actedIn, ?)
RankMIAGAT-ATScore
1The Good Shepherd2.21
2Rounders2.03
3Behind the Candelabra1.90
RankMeanScore
1The Good Shepherd1.99
2Rounders1.74
3J.K.Simmons1.73
Table 4. Missing Triple Case 2.
Table 4. Missing Triple Case 2.
Missing Triple: (German Iran, isLocatedIn, ?)
RankMIAGAT-ATScore
1Kharqan Rural District3.14
2Bastam District3.06
3Kalat-e Hay-ye Gharbi1.44
RankMeanScore
1Kharqan Rural District2.94
2Bastam District2.84
3Kalat-e Hay-ye Gharbi1.08
Table 5. Ablation study results of the MIAGAT-AT model.
Table 5. Ablation study results of the MIAGAT-AT model.
ModuleSettingsMRRHits@1Hits@3Hits@10
MIAGATw/o Visual31.318.739.253.1
w/o Textual31.419.339.152.9
w/o Numerical31.118.339.253.0
w/o MS31.519.039.353.1
w/o AT32.520.140.353.1
ATResidual-CNN36.227.141.453.1
DenseResidual-CNN36.026.941.353.0
DenseResidual-MLP36.928.441.952.8
Residual-MLP37.028.541.953.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yilahun, H.; Aili, E.; Imam, S.; Hamdulla, A. Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion. Information 2025, 16, 907. https://doi.org/10.3390/info16100907

AMA Style

Yilahun H, Aili E, Imam S, Hamdulla A. Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion. Information. 2025; 16(10):907. https://doi.org/10.3390/info16100907

Chicago/Turabian Style

Yilahun, Hankiz, Elyar Aili, Seyyare Imam, and Askar Hamdulla. 2025. "Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion" Information 16, no. 10: 907. https://doi.org/10.3390/info16100907

APA Style

Yilahun, H., Aili, E., Imam, S., & Hamdulla, A. (2025). Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion. Information, 16(10), 907. https://doi.org/10.3390/info16100907

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop