GC4MRec: Generative-Contrastive for Multimodal Recommendation

Wang, Lei; Li, Yingjie; Wang, Heran; Li, Jun

doi:10.3390/app15073666

Open AccessArticle

GC4MRec: Generative-Contrastive for Multimodal Recommendation

by

Lei Wang

^†

,

Yingjie Li

^*,†,

Heran Wang

and

Jun Li

School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(7), 3666; https://doi.org/10.3390/app15073666

Submission received: 19 January 2025 / Revised: 22 March 2025 / Accepted: 25 March 2025 / Published: 27 March 2025

(This article belongs to the Section Applied Industrial Technologies)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of information technology has led to an explosion of data, posing a significant challenge for data processing. Recommendation systems aim to address this by providing personalized content recommendations to users from vast datasets. Recently, multimodal recommendation systems have gained considerable attention due to their ability to leverage diverse data modalities (e.g., images and text) for more accurate recommendations. However, effectively fusing these modalities to accurately represent user preferences remains a challenging task, despite progress made by existing multimodal recommendation approaches. To address this challenge, we propose a novel method which we call GC4MRec (Generative-Contrastive for Multimodal Recommendation). On the one hand, we design a bilateral information flow module using two graph convolutional networks (GCNs). This module captures modal features from two distinct perspectives—standard and generatively augmented—to extract latent preferences. On the other hand, we introduce a novel modality fusion module that dynamically represents user multimodal fusion preferences, enabling the construction of accurate user preference profiles. Finally, we evaluate our proposed method, GC4MRec, on three public real-world datasets and demonstrate its effectiveness compared to the state-of-the-art methods.

Keywords:

multimodal recommendation; generative models; contrastive learning; graph convolutional network; multimodal fusion

1. Introduction

Recommender systems are widely applied in various domains of the internet, such as e-commerce, social media, and entertainment, aiming to personalize recommendations by modeling collected data and establishing latent mappings between users and items. Content-based recommendation models extract features representing item characteristics and construct user profiles reflecting past preferences [1,2]. Collaborative filtering (CF) identifies users with similar preferences based on their ratings or behaviors and then recommends items liked by similar users but not yet experienced by the target user [3,4]. CF techniques may utilize a variety of characteristics as the basis for establishing similarity [5,6]. The traditional recommendation algorithms build relatively simple and shallow mappings between users and recommendation targets. These methods struggle to discern user preferences in scenarios where user interaction data are sparse and are susceptible to data biases.

In recent years, deep learning-based recommendation algorithms have made significant progress,. Graph convolutional networks (GCNs) are increasingly employed in recommender systems to model user–item relationships [7,8]. GCNs operate on a user–item interaction graph, propagating information through graph convolutional layers [9]. They capture higher-order connectivity between nodes by aggregating and propagating information across nodes with interactions, enabling the recommendation of items of interest to users [10,11]. However, conventional deep learning-based recommendation systems fail to leverage the modality diversity of user interaction data. In real-world scenarios, user preferences can be reflected by data from different modalities, which exhibit user data complementarity. The proper fusion of these modalities can construct a comprehensive user profile, enabling the more accurate capture of user needs compared to single-modal recommendation systems.

Existing works have combined multimodal data with GCNs, constructing independent channels to convolute and propagate user preference information for different modalities [12,13]. DGVAE [14] enhances the interpretability of multimodal recommendations by unifying the mapping the data of image and text modalities into text, performing graph encoding and reconstruction separately. Deconstructing graph nodes to incorporate auxiliary information can enhance the performance of recommender systems [15,16]. GCNs are particularly well suited for processing multimodal data due to their inherent ability to model complex relationships [17,18]. For instance, MMGCN [12] offers a novel approach to personalized micro-video recommendation. Its strength lies in explicitly modeling user preferences across modalities, leveraging graph convolutional networks for representation refinement through information exchange on modality-aware user–item graphs. FREEDOM [19] addresses multimodal recommendation by freezing the item–item graph and denoising the user–item graph via degree-sensitive pruning. This approach achieves state-of-the-art performance with significantly reduced computational cost. LATTICE [13] addresses multimedia recommendation by learning latent item–item structures from multimodal features using a novel modality-aware structure learning layer and graph convolutions. However, some methods employ early-stage modality fusion techniques, such as simple weighted fusion or concatenation, lacking the ability to deeply explore the underlying preferences of user modalities. When the number of convolutional layers in GCNs increases, over-smoothing issues arise. Some methods lack optimization for this drawback, while others utilize different approaches to mitigate the problem; for instance, LightGCN [8] mitigates over-smoothing by linearly reducing the contribution factor of higher-order graph convolution propagation.

Recently, contrastive learning (a form of self-supervised learning) has been widely applied in recommender systems [20,21]. It enhances model representation capabilities by maximizing the similarity between positive samples and minimizing the similarity between negative samples [22,23]. However, the sparsity of samples and the design of pretext tasks significantly impact the training effectiveness of contrastive learning.

To address the aforementioned challenges, we propose a novel multimodal recommendation learning framework. Regarding data sparsity and pretext task setting in contrastive learning, we first project the graph features using a linear layer, followed by graph reconstruction using a Variational Graph Autoencoder (VGAE) [24], resulting in original graph G and augmented graph G’. The original graph and the augmented graph are treated as a pair of positive samples, while others are treated as negative samples. The InfoNCE loss function is employed to optimize the representation learning ability of the linear layer. The introduction of VGAE enhances the model’s ability to explore latent user features and narrows the distance in feature space between nodes with similar preferences. To mitigate over-smoothing, we employ cross-residual connections between the input and output layers of the two GCNs. This counteracts the tendency of individual GCNs to produce homogenized representations. Due to the generative capability of VGAE, it can mitigate the over-smoothing effect of GCN.

Regarding multimodal fusion, we hypothesize that user preferences exhibit a long-tail effect, meaning that users pay significant attention to a small portion of item features. Therefore, we propose a metadata slicing generation method: we segment the features of users and items into chunks of varying lengths, where each chunk is mapped to a specific attribute entity (color, size, etc.), referred to as “metadata”. The length of the metadata reflects the intensity of preference for a particular attribute. We hypothesize that users are typically interested in only a subset of attributes; therefore, we determine the length of metadata by sampling from a Gaussian distribution. This ensures that a small portion of metadata captures the majority of the feature length, thereby effectively identifying the users’ primary points of interest. The channel attention mechanism [25], originally used for convolutional graph attention calculation in image processing, is introduced into GC4MRec for fusing different modal data. In this paper, each metadata point is treated as a channel. First, we pool the metadata, cross-add the modalities, and perform channel-wise convolution to obtain channel weights. The channel weights are multiplied by the original metadata channel to obtain the final metadata value. Finally, the different modal data are summed and averaged to obtain the final modality fusion result. In summary, our contributions are as follows:

A novel multimodal learning paradigm employing enhanced self-supervised learning is introduced to uncover latent user preferences and enable effective multimodal fusion.
Latent user preferences are captured more effectively via a dual spatio-temporal cross-residual module, which integrates information from both the original and enhanced graph at various temporal stages.
A metadata-aware fusion method provides a flexible mechanism for fusing multimodal data.
Experimental validation is performed on several public datasets.

Following this introductory section, Section 2 reviews the relevant literature on multimodal recommendation, contrastive learning, and generative models, identifying limitations in existing approaches. Section 3 details the GC4MRec model, including graph information enhancement, subgraph generation, reconstruction, contrastive learning, and multimodal fusion. Section 4 presents a comprehensive experimental evaluation on three datasets, comparing GC4MRec against baselines and assessing component contributions. Finally, Section 5 concludes by summarizing findings, discussing implications, and outlining future research directions.

2. Related Work

2.1. Multimodal Recommendation

The recent surge in computational power and the explosive growth of data have led to a growing number of studies exploring the integration of multimodal techniques into recommender systems. Multimodal recommender systems leverage data from various modalities, such as texts and images, to provide a more comprehensive and complementary representation of rich semantic information. This approach facilitates a more accurate and nuanced understanding of user preferences, ultimately leading to improved recommendation performance.

VBPR [26] pioneered the incorporation of image modality semantic information with ID-based user interaction data to generate user preference rankings for recommended items. Based on this foundation, researchers have conducted further exploratory work with the aim of capturing fine-grained user–item interaction preferences. To this end, researchers have incorporated attention mechanisms into multimodal recommendation systems [27,28,29], effectively fusing information across different modalities, leveraging their complementary aspects, and enhancing the accuracy of content recommendations. High-order connectivity information is crucial for the performance of recommendation systems.To this end, numerous studies have leveraged graph neural networks to propagate latent interaction information among users or items, enabling more comprehensive and accurate preference recommendations for users [12,13,29].

The key distinction between multimodal recommendation and unimodal recommendation lies in effectively handling data from diverse modalities and seamlessly fusing information represented across these modalities. MMSSL [30] leverages adversarial self-supervised learning to enhance modality-aware graphs for extracting modal data and employs cross-modal attention mechanisms for modality fusion. MMGCN [12] establishes interaction graphs for different modalities, extracting unimodal information individually before fusing them through a stacking operation.

In contrast to conventional approaches, our fusion method acknowledges the specific characteristics of recommendation tasks. We posit that user attention exhibits a long-tail distribution, implying that users prioritize specific features. Therefore, our method focuses on capturing and assigning higher weights to these user-emphasized features.

2.2. Contrastive Learning

Contrastive learning, a self-supervised learning approach, has gained significant traction in recent years. It excels at learning effective embedding representations by maximizing the similarity between positive sample pairs and minimizing the similarity between negative pairs within an embedding space. This method has proven particularly valuable in scenarios with limited labeled data. CLIP [31] constructs a multimodal embedding space and leverages contrastive learning to explore the semantic relationships between unlabeled text–image pairs. This approach results in a pre-trained model with a strong understanding of both textual and visual semantics, inspiring us to explore the integration of contrastive learning within multimodal recommender systems.

SGL [32] employs node dropout, edge dropout, and random walk operations to generate contrastive views, maximizing the agreement between representations of the same node across different views. This self-discrimination approach effectively learns the consistency of individual nodes under diverse perspectives, ultimately leading to enhanced node representation capabilities. CL4SRec [33] introduces a contrastive learning approach for sequence recommendation that leverages self-supervision signals captured through data augmentation. Specifically, it employs three augmentation strategies: item cropping, item masking, and item reordering, to generate diverse views of the input data. DHCN [34] proposes a contrastive learning framework based on hypergraph session views for sequence recommendation. This method constructs hypergraphs to represent user sessions and applies contrastive loss to encourage similar sessions to have closer representations in the embedding space. EGLN [35] focuses on graph-based recommendation and performs graph augmentation by maximizing the mutual information between local and global features of the graph. This approach aims to enhance the representation learning by capturing both fine-grained and high-level structural information. NCL [36] enhances the agreement between nodes with connectivity within the neighborhood while reducing the agreement between nodes from different neighborhoods. This method emphasizes the local structure of the graph and improves the node representation learning.

2.3. Generative Models in Recommendation

Generative models have emerged as a transformative force across various domains in recent years. By learning the underlying distribution of data, these models can generate novel data samples. This remarkable generative capability has injected fresh vitality into recommendation systems. DreamRec [37] generates an ideal item by learning user interaction sequences and recommends items within the sample space that exhibit the highest similarity to the ideal item. ACVAE [38] enhances the accuracy and robustness of Variational Autoencoders (VAEs) by incorporating an adversarial component. CLLM4Rec [39] adapts large language models (LLMs) to collaborative recommendation tasks through prompting techniques. BERT4Rec [40] leverages the powerful sequence modeling capabilities of transformers to model user interaction sequences and predict subsequent user behavior. Meta-SGCL [41] optimizes the sequence representation ability of recommendation models by combining generative and contrastive learning approaches.

3. Methodology

In this section, we detail the proposed GC4MRec. We begin by introducing the fundamental notations and definitions used throughout this paper. Figure 1 shows the overall architecture of GC4MRec. We model different modalities of data separately, resulting in modality-aware graphs. During GCN propagation, we design two distinct GCNs to capture high-order connectivity. Through cross-residual connections, we integrate cross-temporal and cross-modal information, mitigating over-smoothing. The GCN outputs are then fed into a Variational Graph Autoencoder (VGAE), where subgraphs are sampled to obtain augmented graph data. These augmented data are utilized in a contrastive learning framework to train the model and capture latent user preferences. In the multimodal fusion stage, we introduce the concept of metadata to model user preferences at different granularities. Channel attention is employed to compute attention weights for metadata from different modalities thereby dynamically fusing multimodal information based on specific user preferences.

3.1. Preliminaries

Let us define

U = {u_{1}, u_{2}, \dots, u_{| U |}}

representing the user set, and

I = {i_{1}, i_{2}, \dots, i_{| I |}}

representing the item set.

A \in R^{| U | \times | I |}

denotes the user–item adjacency matrix, capturing the interactions between users and items.

G = (V, E)

denotes the user–item interaction graph, where

V = U \cup I

represents the set of all nodes (users and items) and

E \subseteq V \times V

represents the set of edges.

M = {t, i}

denotes the set of modalities, where t represents the text modality and i represents the image modality. We denote with

f_{i}^{t} \in R^{d_{t}}

the textual feature embedding for item

i \in I

, where

d_{t}

represents the dimensionality of the textual feature space.

e_{i} \in R^{d}

denotes the overall embedding representation of item

i \in I

, where d is the embedding dimension.

e_{u} \in R^{d}

denotes the overall embedding representation of user

u \in U

, where d is the embedding dimension.

3.2. Graph Information Enhancement

To obtain contrastive positive samples with both rich semantic representations and smooth contextual information, we employ a graph information enhancement approach.

3.2.1. Modality-Aware Subgraph Generation

Specifically, we utilize a Variational Graph Autoencoder (VGAE) to re-encode subgraphs, mapping them into latent representations. For each modality

m \in M

, we construct a modality-specific subgraph

G_{c}^{m} = (V_{c}, E_{c}, X_{c}^{m})

centered at node c. Here,

X c^{m} \in R^{N \times d_{m}}

represents the feature matrix of the subgraph nodes for modality m, where the i-th row corresponds to the feature vector

f_{v_{i}}^{m}

of node

v_{i} \in V_{c}

. The VGAE model learns to encode the structural information and node features of

G_{c}^{m}

into a lower-dimensional latent representation

z_{c}^{m} \in R^{d^{'}}

, where

d^{'}

is the dimension of the latent space. This encoding process aims to capture the essential characteristics of the subgraph from the perspective of modality m.

3.2.2. Graph Reconstruction with Resample

To obtain enhanced samples, we assume that each sample

x

has a latent representation

x^{'}

. This latent distribution is derived as follows:

q ({\hat{X}}^{m} | X^{m}, A) = \prod_{i = 1}^{N} q ({\hat{x}}_{i}^{m} | {\hat{X}}^{m}, A) .

(1)

where

{\hat{X}}_{m}

and

X_{m}

represent the latent features and raw features of modality m, respectively, and

A

is the adjacency matrix.

{\hat{x}}_{i}^{m}

represents the latent representation distribution of a user or item. We define this latent distribution as:

q ({\hat{x}}_{i}^{m} | {\hat{X}}^{m}, A) = N ({\hat{x}}_{i}^{m} ∣ μ_{i}^{m}, diag {(σ_{i}^{m})}^{2}) .

(2)

where

diag {(σ_{i}^{m})}^{2}

denotes the diagonal covariance matrix of the latent distribution for generating

{\hat{X}}^{m}

where

σ_{i}^{m}

represents standard deviations for the i-th node in modality m. This diagonal structure assumes independence among latent dimensions, balancing computational efficiency and uncertainty modeling. It enables VGAE to generate graph nodes that capture users’ latent preferences, improving recommendation performance by accurately representing user–item interactions. Here,

μ_{i}^{m}

and

σ_{i}^{m}

are defined as follows:

μ_{i}^{m} = {GCN}_{μ} (X_{i}^{m} W^{μ} + b^{μ}) + ϵ

(3)

σ_{i}^{m} = {GCN}_{σ} (X_{i}^{m} W^{σ} + b^{σ}) + ϵ

(4)

where

W^{μ}

,

b^{μ}

,

W^{σ}

, and

b^{σ}

are learnable parameters, and

ϵ

is a noise signal sampled from a standard Gaussian distribution.

3.2.3. Optimization

During latent variable distribution reconstruction, the adjacency matrix

A

plays a crucial role as a supervisory signal. We construct the modality-aware posterior probability distribution of the adjacency matrix derived as follows:

p (A | {\hat{X}}^{m}) = \prod_{i = 1}^{N} \prod_{j = 1}^{N} p (A_{i j} | {\hat{x}}_{i}^{m}, {\hat{x}}_{j}^{m})

(5)

The non-probabilistic interaction edge between nodes i and j in the modality-aware adjacency matrix

{\hat{A}}^{m}

is calculated as follows:

{\hat{A}}_{i j}^{m} = \{\begin{matrix} 1, & if \frac{{({\hat{x}}_{i}^{m})}^{⊤} {\hat{x}}_{j}^{m}}{| {\hat{x}}_{i}^{m} | | {\hat{x}}_{j}^{m} |} > θ, \\ 0, & otherwise . \end{matrix}

(6)

Here,

θ

is a learnable parameter acting as a pre-defined threshold. Leveraging the foundational architecture of our approach, the contrastive learning module seeks to utilize high-quality samples. Due to its enhanced generalization and rapid generation capabilities relative to alternatives like diffusion models, we employ VGAE. This strategy seeks to generate samples that accurately reflect underlying user preferences. Only then can contrastive learning yield precise and generalizable representations, bolstering subsequent prediction accuracy. The variational lower bound, a common optimization objective for generative models, is instrumental in achieving high performance. The underlying principle is learning a latent representation of the input graph while ensuring the generated graph structure closely resembles the original. More specifically, our training target is to optimize the parameters of this variational model. We maximize its variational lower bound, which encourages the distribution of

\hat{X}

to match the distribution of X and penalizes its divergence from a standard Gaussian distribution. The lower bound is defined as:

\begin{matrix} L_{v g a e} = KL [q (\hat{X} ∣ X, A) ∣ ∣ p (Z)] - E q (\hat{X} | X, A) [log p (A ∣ \hat{X})] \end{matrix}

(7)

KL

denotes the Kullback–Leibler divergence.

p (Z) = \prod_{i} N (z_{i} | 0, I)

represents the Gaussian prior, with

z_{i}

being the latent representation of node i.

3.3. Graph Contrastive Learning

To further excavate user preferences and capture intricate high-order connectivity patterns between nodes, we leverage the self-supervised nature of contrastive learning. By encouraging similar nodes to be closer in the embedding space and dissimilar nodes to be further apart, the model aims to enhance the discriminative power of node representations. This design facilitates more accurate node classification and crucially improves the recommendation performance, as the final prediction stage of the model relies on computing the similarity between users and items to determine whether an item should be recommended to a user. Therefore, the model’s ability to effectively discriminate between nodes is essential for ensuring high-quality recommendations. This approach encourages the model to learn representations where similar nodes are closer in the embedding space, while dissimilar nodes are pushed further apart.

Designing a reasonable contrastive pretext task can fully leverage the self-supervised signals in the corpus. We first sample the original graph G to obtain multiple subgraphs:

G^{s u b} = {g_{1}^{s u b}, g_{2}^{s u b}, \dots, g_{N}^{s u b}}

. Then, we generate an augmented copy of each subgraph using VGAE, which is derived as follows:

G^{\tilde{s u b}} = V G A E (M L P (G^{s u b}), A^{s u b})

(8)

where

A^{s u b}

is the adjacency matrix corresponding to

G^{s u b}

. We construct N pairs of positive samples, where each pair consists of

(g^{s u b}, g^{\tilde{s u b}})

. All other pairs are considered negative samples. We use infoNCE to optimize the parameters and the contrastive loss is derived as follows:

L_{c l} = - \sum_{m \in M} \sum_{i = 1}^{N} log \frac{exp ({(x_{i}^{m})}^{T} (\tilde{x_{i}^{m}}) / τ)}{\sum_{k = 1}^{N} exp ({(x_{i}^{m})}^{T} (\tilde{x_{k}^{m}}) / τ)}

(9)

where

x_{i}

and

{\tilde{x}}_{i}

represent the embedding vectors of the corresponding nodes in the original subgraph and the augmented subgraph, respectively, for the i-th positive sample pair, and

τ

is a temperature parameter that controls the sharpness of the contrastive loss.

3.4. Dual Spatio-Temporal Cross-Residual Module

To effectively capture users’ latent preferences, we developed a spatio-temporal cross-residual module. This module employs a residual connection between an original and an enhanced GCN to integrate graph propagation information from different temporal steps. The derivation of the architecture is presented below:

e^{m} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} [{GCN}_{normal} (e_{i}^{m}) + {GCN}_{enhanced} (e_{N - j + 1}^{m})]

(10)

e^{m}

represents the embedding for modality m, N denotes the total number of GCN propagation layers, i is the index of the

G C N_{n o r m a l}

layer, and j is the index of the

G C N_{e n h a n c e d}

layer.

\begin{matrix} x & = G C N_{e n h a n c e d} (e^{m}) \\ x & = x + V G A E (G^{s u b (x)}, A^{s u b (x)}) \end{matrix}

(11)

For simplicity, we let x =

G C N_{e n h a n c e d} (e^{m})

and then generate a set of subgraphs and adjacency matrices for x. We use VGAE to generate enhanced subgraphs, which are then residual-connected with

G C N_{e n h a n c e d} (e^{m})

.

3.5. Multimodal Fusion

Fusing embeddings from different modalities is a crucial step in multimodal content processing. As illustrated in Figure 2, modeling user preferences based on single-modal information is often insufficient. When users exhibit heightened sensitivity to a particular modality, incorporating multiple modalities enables a more comprehensive modeling of user preferences, significantly enhancing the accuracy of recommendations, a simple model can intuitively demonstrate how different modalities influence user decisions. Therefore, we aim to capture the complex complementary dependencies inherent in multimodal data through a more fine-grained attention mechanism. This approach enables the establishment of clearer and more explicit boundaries for preference semantics. To this end, we propose a modality-channel multi-head attention mechanism to flexibly scale the influence of different modality features on the final representation.

3.5.1. Metadata and Meta-Preference Modeling

We focus on the importance of metadata within user preferences or item attributes. As shown in Figure 3, we define

h^{u}

and

h^{i}

as the metadata chunk features for users and items, respectively, aiming to model atomic user preferences through these metadata blocks, such as material attributes, color characteristics, price features, and more. For instance, this might include a user’s preferred color, their favorite product category, or an item’s type and size. Naturally, each metadata element warrants varying degrees of attention. User preferences exhibit diverse focuses, while item attributes, representing their selling points, are similarly subject to different levels of consideration. Therefore, we employ a dynamic weighted summation of metadata to capture the latent representations embedded within these interactions. This process is defined as follows:

\begin{matrix} H = {h_{1}, h_{2}, \dots, h_{| H |}}, \\ h_{i} = \frac{s_{i}}{\sum_{j = 1}^{| H |} s_{j}} \cdot d, \\ s_{i} \sim N (0, 1), i = 1, 2, \dots, | H | . \end{matrix}

(12)

s_{i}

represents a randomly sampled value from a normal Gaussian distribution with a mean of 0 and a variance of 1.

| H |

signifies the number of metadata, which corresponds to the number of samples.

h_{i}

represents the length of each metadata point. To isolate these distinct attribute features within each sample and map them to separate vector spaces for processing, we employ multi-head attention. For each metadata index

i \in | H |

, the corresponding feature

e_{i}^{m}

with length

h_{i}

is linearly projected into

| H |

different subspaces, resulting in

Q_{i}^{| H |}, K_{i}^{| H |}, V_{i}^{| H |}

as follows:

\begin{matrix} Q_{i}^{| H |} = W_{Q}^{| H |} e_{i}^{m}, \\ K_{i}^{| H |} = W_{K}^{| H |} e_{i}^{m}, \\ V_{i}^{| H |} = W_{V}^{| H |} e_{i}^{m} . \end{matrix}

(13)

where

W_{Q}^{| H |}, W_{K}^{| H |}, W_{V}^{| H |}

are learnable weight matrices, and

| H |

is the number of attention heads. We calculate the attention weights between each metadata point: for each head h, we calculate the attention weight between each metadata point i and other metadata points j:

A t t e n t i o n {(Q_{i}, K_{j}, V_{j})}_{h} = s o f t m a x (Q_{i}^{h} {K_{j}^{h}}^{T} / \sqrt{d_{k}}) V_{j}^{h},

(14)

where

d_{k}

is the dimension of

K_{j}^{h}

, and the softmax function is used to normalize the attention weights. We perform a weighted summation of the output for each metadata point: for each head h, we perform a weighted summation of the output

A t t e n t i o n {(Q_{i}, K_{j}, V_{j})}_{h}

for all metadata i, resulting in the final output for each metadata point

A t t e n t i o n {(Q_{i}, K, V)}_{h}

:

A t t e n t i o n (Q_{i}, K, V) h = \sum_{j = 1}^{| H |} A t t e n t i o n {(Q_{i}, K_{j}, V_{j})}_{h} .

(15)

We integrate the outputs of all metadata, concatenate the outputs of all heads h, and perform a linear projection using a learnable weight matrix

W^{O}

, resulting in the final output:

e_{i}^{m} = W^{O} [A t t e n t i o n {(Q_{i}, K, V)}_{1}, \dots, A t t e n t i o n {(Q_{i}, K, V)}_{h}] .

(16)

e^{m} = [\begin{matrix} e_{1}^{m}, e_{2}^{m}, \dots, e_{| H |}^{m} \end{matrix}]

(17)

e_{i}^{m}

represents the output of the i-th metadata point calculated across

| H |

attention heads, while

e^{m}

denotes the final output after processing all metadata.

3.5.2. Channel Cross-Fusion Convolution (CCFC)

Channel attention mechanisms have achieved remarkable success in convolutional operations for image processing [25,42]. Inspired by this, we introduce a novel multimodal dynamic fusion approach. We treat different modalities as distinct channels and initially cross-integrate the data from these channels, treating the result as a single modality. Subsequently, we employ a 1D deformable convolutional kernel to perform a convolution aggregation operation on the k nearest metadata entries to the target metadata. The aggregated result then serves as the new representation of the target metadata, defined as:

\begin{matrix} p_{i}^{m} = m e a n (e_{i}^{m}), \\ P = p_{i}^{t e x t} \oplus p_{i}^{i m a g e}, \\ W_{f u s i o n} = σ (W_{c} * P), \\ e = W_{f u s i o n} * e . \end{matrix}

(18)

We begin by applying average pooling to the embedding

e_{i}^{m}

of the specified modality and metadata. The pooled results of the metadata across different modalities are then concatenated element-wise. Subsequently, we map the concatenated vector P to attention scores using a learnable matrix

W_{c}

. Finally, we multiply these attention scores with the original features to update the features with the integrated weights. More specifically, the features of both text and image samples are divided into multiple metadata units, with the set of H = {

h_{1}

,

h_{2}

, …,

h_{| H |}

}. The size of each metadata unit is determined by sampling from a Gaussian distribution, as described in Section 3.5.1. Subsequently, pooling operations are applied separately to the metadata of the image and text modalities. The pooled text metadata and image metadata are then concatenated in a staggered manner, as illustrated in the modality fusion module of Figure 1. Following this, channel attention is employed to compute the attention scores for each metadata unit. These scores are used as weights to scale the original metadata, thereby adjusting the corresponding modality features. The scaled features of both modalities are then restored to their original forms and combined through a weighted averaging process to produce the final fused result.

3.6. Prediction and Optimization

The final GCN outputs user and item representations, denoted as

h_{u}

and

h_{i}

, respectively. The user–item interaction score,

z_{u, i}

, can be computed using the following formula:

y_{u, i} = σ (h_{u}^{T} * h_{i})

(19)

To optimize the ranking quality, we utilize the Bayesian Personalized Ranking (BPR) loss function. This approach involves constructing a triplet

(u, i, j)

, where u represents the user, i is an item that the user has interacted with (positive example), and j is an item that the user has not interacted with (negative example). Based on these triplets, we calculate the corresponding positive example score,

y_{u, i}

, and negative example score,

y_{u, j}

. Our objective is to minimize the following BPR loss function:

L_{b p r} = - \sum_{(u, i, j) \in D} log (y_{u, i} - y_{u, j}) + {λ | | θ | |}^{2}

(20)

To prevent model overfitting caused by excessive model parameters, we introduce a regularization term. Specifically, we employ L2 regularization, which is defined as

{λ | | θ | |}^{2}

, where

λ

is the regularization strength, a positive hyperparameter that controls the amount of regularization applied. Figure 4 illustrates the sensitivity of the regularization hyperparameter

λ

.

θ

represents the set of model parameters. The overall loss function of our model is defined as follows:

L_{r e c} = L_{v g a e} + L_{c l} + L_{b p r}

(21)

4. Experiment

In this section, we conduct comprehensive experiments to evaluate the performance of our proposed GC4MRec on three widely used real-world datasets.

4.1. Experimental Settings

4.1.1. Datasets

To evaluate our proposed GC4MRec in the top-N item recommendation task, we conduct extensive experiments on three categories of the widely used Amazon dataset: Sports, Baby, and Clothing. These datasets offer items from distinct categories and vary in the number of users, items, and interactions. The statistics of these datasets are presented in Table 1. Each dataset comprises both textual and visual modalities. Textual features are extracted as 384-dimensional embeddings using pre-trained sentence transformers. For the visual modality, we utilize publicly available 4,096-dimensional visual features.

4.1.2. Baselines

To demonstrate the effectiveness of our proposed GC4MRec, we compare it with the following state-of-the-art recommendation methods, spanning both general collaborative filtering and multimedia recommendation approaches.

General Collaborative Filtering:

* MF-BPR [43]: This method employs matrix factorization (MF) as its core model and optimizes the user and item latent representations using Bayesian Personalized Ranking (BPR) loss.

* LightGCN [8]: This method streamlines the conventional graph convolutional network (GCN) by eliminating non-linear activation and feature transformation layers, focusing on efficient neighborhood aggregation for recommendation.

Multimedia Recommendation:

* VBPR [26]: This method incorporates visual features as side information to enhance MF-BPR, improving recommendation performance by leveraging item visual characteristics. For a fair comparison, we extend VBPR to include textual modality in our experiments.

* MMGCN [12]: This method constructs modality-specific graphs and employs a GCN for each modality. The final prediction is obtained by fusing the representations learned from each modality’s GCN.

* LATTICE [13]: This method leverages multimodal features to construct an item semantic graph, capturing latent semantic correlations between items and improving recommendation quality. As our proposed method builds upon and improves LATTICE, we highlight these improvements in the Results section.

* DGVAE [14]: This method maps multimodal data into textual representations and employs VGAE for modeling and reconstructing the item–item graph, thereby enhancing the accuracy of the model’s recommendations.

* FREEDOM [19]: This method refines the user–item graph used in LATTICE by removing potential false-positive edges and freezes the item–item graph to boost recommendation performance.

* SLMRec [44]: This method integrates self-supervised learning into multimedia recommendation, utilizing data augmentations to uncover multimodal patterns and improve contrastive learning.

* MMSSL [30]: This method introduces a modality-aware interactive structure learning paradigm using adversarial perturbations and employs cross-modal contrastive learning to disentangle shared and modality-specific features.

4.1.3. Evaluation Protocols

To evaluate the performance fairly, we adopt Recall@k and NDCG@k as our evaluation metrics, where Recall@k measures the proportion of relevant items (e.g., user-interacted items) retrieved within the top k recommendations, and NDCG@k evaluates both the relevance and positional importance of ranked lists. Specifically, we employ Recall@10 (R@10), Recall@20 (R@20), NDCG@10 (N@10), and NDCG@20 (N@20). We report the metrics of all users in the test dataset. We follow the popular evaluation setting, splitting the data randomly into 8:1:1 for training, validation, and testing.

4.1.4. Implementation Details

Our proposed model is implemented using PyTorch [45]. We use an embedding dimension of 64 for both users and items across all models for consistency. The embedding parameters are initialized using Xavier initialization and optimized using the Adam optimizer [46] with a mini-batch size of 1024. The number of cross-residual layers is set to 6 and the first layer for all modalities is dropped. Other hyperparameters specific to our method, such as weighting factors for different modalities or graph construction parameters, are tuned empirically on the validation set. We train for a maximum of 500 epochs and employ early stopping based on the performance (e.g., Recall@20) on the validation set, stopping if performance does not improve for 20 consecutive epochs.

4.2. Performance Comparison

Table 2 presents the performance comparison of our proposed GC4MRec method against several state-of-the-art baselines on three datasets. We observe the following key trends:

1. Effectiveness of Graph-Based and Multimodal Methods: GCN-based methods consistently outperform MF-based models (MF-BPR and VBPR) across all datasets and metrics. This highlights the effectiveness of modeling user–item interactions as a graph and leveraging graph convolutional operations. Furthermore, incorporating multimodal information generally leads to improved performance. Multimodal methods like VBPR, LATTICE, FREEDOM, SLMRec, and MMSSL outperform MF-BPR in most cases. This demonstrates the value of leveraging visual and textual features for enriching item representations and improving recommendation quality.

2. Superiority of GC4MRec: Our proposed GC4MRec method achieves the best performance across all datasets and metrics. It surpasses the strongest baseline of DGVAE, with improvements on Sports, Baby, and Clothing. GC4MRec achieves relative improvements of 4.12%, 6.45% and 4.20%, respectively, in terms of R@10 over DGVAE.

These gains demonstrate the effectiveness of GC4MRec.

3. Impact of Multimodal Features: While multimodal methods generally perform better, the effectiveness of multimodal features can vary across datasets. For example, DualGNN significantly improves upon LightGCN on Clothing, where visual and textual information likely plays a more crucial role in characterizing items, but shows less pronounced gains on Sports and Baby. This suggests that the informativeness of multimodal features can be dataset-dependent.

4. Benefits of Graph Structure Refinement: Methods that refine or augment the graph structure, such as FREEDOM (with its denoising and freezing components) and LATTICE (with its latent item–item graph), tend to perform well. FREEDOM consistently achieves strong results, highlighting the importance of mitigating noise in user–item interactions and leveraging robust item relationships.

5. Comparison with Self-Supervised Methods: Self-supervised methods like SLMRec, BM3, and MMSSL demonstrate competitive performance, showcasing the potential of leveraging self-supervision for learning better representations. However, GC4MRec still outperforms these methods. Through extensive experiments on three publicly available datasets, the proposed GC4MRec model demonstrably surpasses state-of-the-art models, highlighting its effectiveness compared to existing self-supervised methods.

Our work introduces a novel approach for enhancing subgraph representation through generative contrastive learning. By generating positive sample pairs from augmented subgraphs, the model effectively amplifies the representation of latent features within the data. Furthermore, a novel dual spatio-temporal cross-residual module is proposed, effectively fusing information from the original and augmented subgraphs, enabling the capture of dynamic user preference shifts across different temporal contexts. A key innovation lies in the channel cross fusion convolution strategy, dynamically learning weights for metadata across different modalities. This enables effective multimodal information fusion, leading to more accurate preference recommendations for users.

These gains demonstrate the effectiveness of GC4MRec. Its superiority is further corroborated by the performance curves depicted in Figure 5. GC4MRec continues to improve with increasing epochs, eventually converging to significantly higher values for both metrics, indicating that GC4MRec’s advantage becomes more pronounced with extended training. This suggests that GC4MRec is better able to leverage the available data and learn more nuanced representations over time. This visual analysis further underscores the effectiveness of our proposed bilateral information flow module, generative augmentation via VGAE, and metadata preference modeling in capturing latent user preferences for improving recommendation accuracy. The improvement across datasets and metrics, as observed in Table 2, provides compelling evidence for the robustness and generalizability of GC4MRec.

4.3. Ablation Study

To thoroughly examine the impact of individual components within our proposed GC4MRec model, as shown in Figure 6, a comprehensive ablation study was conducted across three distinct real-world datasets. This analysis aims to isolate and quantify the contributions of each component to the overall model performance. The following GC4MRec variants were systematically evaluated to achieve this objective:

* GC4MRec-GCN-ONLY: This variant removes all modules from GC4MRec, using only GCN. This helps analyze the impact of the layer-wise convolution approach. We can see that GC4MRec always outperforms GC4MRec-GCN-ONLY.

* GC4MRec-NoModalityFusion: This variant removes the multimodal fusion strategy from GC4MRec and retains other modules. This assesses the impact of multimodal fusion on performance. The ablation shows that the performance drops by 3.5%, 3.5%, and 5.5% respectively, compared with GC4MRec, indicating that capturing information from multiple modalities fusion is essential for accurate recommendations.

* GC4MRec-NoGraphRefinement: This variant ablates the graph refinement process. We observe a performance decrease of 3%, 3%, and 5.4% when graph refinement is removed, showing that refining the graph structure is crucial for improving recommendation accuracy.

These ablation results demonstrate that each component of GC4MRec contributes to its overall performance. Removing any of these components leads to a noticeable decrease in recommendation accuracy, validating the design choices and effectiveness of our proposed method.

4.4. Hyperparameter Combination Effectiveness

We evaluate the performance of GC4MRec under different combinations of learning rate and regularization weight. The learning rate is varied from 1 × 10⁻¹ to 1 ×

10^{- 6}

, and the regularization weight from 1 × 10⁻⁵ to 1 ×

10^{- 1}

. Figure 4 shows the R@20 and N@20 scores achieved by GC4MRec on the Clothing, Baby, and Sports datasets, respectively. The results reveal that the performance of GC4MRec varies across datasets and is sensitive to the choice of both learning rate and regularization weight. For instance, on the Clothing dataset, a higher learning rate generally leads to better R@20 performance, while for the Baby dataset, a moderate learning rate coupled with a smaller regularization weight yields the best results. Further analysis reveals that the optimal hyperparameter configuration appears to be dataset-specific. Further investigation is needed to understand the interplay between these hyperparameters and the characteristics of each dataset.

5. Conclusions

In this work, we proposed GC4MRec, a novel self-supervised learning framework for multimodal recommendation. GC4MRec leverages contrastive learning with a Variational Graph Autoencoder (VGAE) for generating augmented subgraphs, enabling the accurate learning of latent features and high-order user–item connectivity. This approach addresses a key limitation of existing contrastive learning methods in recommendation systems: the difficulty in defining effective pretext tasks and the sparsity of interaction data. The augmented subgraphs provide rich positive sample pairs, enhancing the model’s ability to capture latent user preferences. Furthermore, a novel bidirectional cross-temporal residual module effectively fuses information from both original and augmented subgraphs, capturing dynamic shifts in user preferences across different temporal contexts, a critical aspect often overlooked by static fusion strategies. The introduction of a cross-channel fusion convolution module (CCFC) allows for the dynamic learning of metadata weights from different modalities, enabling more effective and personalized multimodal information fusion compared to simpler methods like concatenation or static weighted averaging.

Extensive experiments on three public datasets demonstrate that GC4MRec outperforms state-of-the-art models. These improvements are not just marginal but reflect GC4MRec’s ability to effectively model complex multimodal interactions and dynamic user preferences. For instance, the significant improvement on the Sports dataset in Recall@10 illustrates GC4MRec’s effectiveness in capturing long-tail user preferences and recommending relevant niche items, a direct result of the metadata-aware fusion mechanism and graph refinement. Ablation studies further validate the contributions of each component (the VGAE-based augmentation, the bidirectional residual module, and the CCFC fusion strategy), showcasing their synergistic effect in achieving superior recommendation accuracy.

The implications of GC4MRec extend beyond the datasets explored here. Its potential is applicable in various recommendation scenarios, such as social media feeds, news aggregation platforms, and e-commerce product suggestions, where diverse modalities and dynamic user behaviors are prevalent. Future work will focus on enriching data diversity by incorporating additional modalities, such as user reviews or product descriptions. This will provide a more holistic view of user preferences and item characteristics. Moreover, we will investigate the integration of large language models (LLMs). Specifically, we envision employing LLMs for generating synthetic user profiles or augmenting item descriptions, addressing the cold-start problem and enhancing the semantic representation of items. We also plan to explore personalized prompt engineering techniques for LLMs tailored to individual user preferences and item characteristics. This will enable more fine-grained control over the generation process and lead to even more accurate and personalized recommendations. By addressing these challenges and exploring these promising directions, GC4MRec provides a robust framework and paves the way for the next generation of multimodal recommendation systems.

Author Contributions

Conceptualization: Y.L. and L.W.; Methodology: Y.L.; Software: L.W.; Validation: L.W., J.L. and H.W.; Formal Analysis: H.W.; Investigation: L.W.; Resources: L.W.; Data Curation: Y.L.; Writing—Original Draft Preparation: Y.L.; Writing—Review and Editing: Y.L.; Visualization: H.W.; Supervision: J.L.; Project Administration: J.L.; Funding Acquisition: L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.12271201).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article, and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lops, P.; de Gemmis, M.; Semeraro, G. Content-based Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook; Spring: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
Mooney, R.J.; Roy, L. Content-based book recommending using learning for text categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, ACM, San Antonio, TX, USA, 2–7 June 2000; DL00. pp. 195–204. [Google Scholar] [CrossRef]
Koren, Y.; Rendle, S.; Bell, R. Advances in collaborative filtering. In Recommender Systems Handbook; Spring: Berlin/Heidelberg, Germany, 2021; pp. 91–142. [Google Scholar] [CrossRef]
Goldberg, D.; Nichols, D.; Oki, B.M.; Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM 1992, 35, 61–70. [Google Scholar] [CrossRef]
Breese, J.S.; Heckerman, D.; Kadie, C. Empirical analysis of predictive algorithms for collaborative filtering. arXiv 2013, arXiv:1301.7363. [Google Scholar]
Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 285–295. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 639–648. [Google Scholar] [CrossRef]
Liu, F.; Cheng, Z.; Zhu, L.; Gao, Z.; Nie, L. Interest-aware message-passing GCN for recommendation. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 12–16 April 2021; pp. 1296–1305. [Google Scholar] [CrossRef]
Peng, S.; Sugiyama, K.; Mine, T. SVD-GCN: A simplified graph convolution paradigm for recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 1625–1634. [Google Scholar] [CrossRef]
Wu, L.; Sun, P.; Hong, R.; Fu, Y.; Wang, X.; Wang, M. Socialgcn: An efficient graph convolutional network based model for social recommendation. arXiv 2018, arXiv:1811.02815. [Google Scholar]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, Y.; Liu, Q.; Wu, S.; Wang, S.; Wang, L. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Online, 20–24 October 2021; pp. 3872–3880. [Google Scholar] [CrossRef]
Zhou, X.; Miao, C. Disentangled graph variational auto-encoder for multimodal recommendation with interpretability. IEEE Trans. Multimed. 2024, 26, 7543–7554. [Google Scholar] [CrossRef]
Zhou, H.; Zhou, X.; Zhang, L.; Shen, Z. Enhancing dyadic relations with homogeneous graphs for multimodal recommendation. In ECAI 2023; IOS Press: Amsterdam, The Netherlands, 2023; pp. 3123–3130. [Google Scholar] [CrossRef]
Liu, K.; Xue, F.; Guo, D.; Sun, P.; Qian, S.; Hong, R. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimed. 2023, 25, 9343–9355. [Google Scholar] [CrossRef]
Zhou, B.; Liang, Y. UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation. Appl. Sci. 2024, 14, 10187. [Google Scholar] [CrossRef]
Cui, X.; Qu, X.; Li, D.; Yang, Y.; Li, Y.; Zhang, X. Mkgcn: Multi-modal knowledge graph convolutional network for music recommender systems. Electronics 2023, 12, 2688. [Google Scholar] [CrossRef]
Zhou, X.; Shen, Z. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October– 3 November 2023; pp. 935–943. [Google Scholar] [CrossRef]
Li, X.; Wang, N.; Zeng, J.; Li, J. Time-frequency sensitive prompt tuning framework for session-based recommendation. Expert Syst. Appl. 2025, 270, 126501. [Google Scholar] [CrossRef]
Wang, X.; Yue, H.; Guo, L.; Guo, F.; He, C.; Han, X. User identification network with contrastive clustering for shared-account recommendation. Inf. Process. Manag. 2025, 62, 104055. [Google Scholar] [CrossRef]
Zhou, C.; Zhou, S.; Huang, J.; Wang, D. Hierarchical Self-Supervised Learning for Knowledge-Aware Recommendation. Appl. Sci. 2024, 14, 9394. [Google Scholar] [CrossRef]
Ma, J.; Wan, Y.; Ma, Z. Memory-Based Learning and Fusion Attention for Few-Shot Food Image Generation Method. Appl. Sci. 2024, 14, 8347. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Variational graph auto-encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
He, R.; McAuley, J. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
Liu, F.; Cheng, Z.; Sun, C.; Wang, Y.; Nie, L.; Kankanhalli, M. User diverse preference modeling by multimodal attentive metric learning. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1526–1534. [Google Scholar] [CrossRef]
Chen, J.; Zhang, H.; He, X.; Nie, L.; Liu, W.; Chua, T.S. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017; pp. 335–344. [Google Scholar]
Tao, Z.; Wei, Y.; Wang, X.; He, X.; Huang, X.; Chua, T.S. Mgat: Multimodal graph attention network for recommendation. Inf. Process. Manag. 2020, 57, 102277. [Google Scholar] [CrossRef]
Wei, W.; Huang, C.; Xia, L.; Zhang, C. Multi-modal self-supervised learning for recommendation. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April– 4 May 2023; pp. 790–800. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a (accessed on 19 January 2025).
Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 726–735. [Google Scholar] [CrossRef]
Xie, X.; Sun, F.; Liu, Z.; Wu, S.; Gao, J.; Zhang, J.; Ding, B.; Cui, B. Contrastive learning for sequential recommendation. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), IEEE, Virtual Event, 9–12 May 2022; pp. 1259–1273. [Google Scholar] [CrossRef]
Xia, X.; Yin, H.; Yu, J.; Wang, Q.; Cui, L.; Zhang, X. Self-supervised hypergraph convolutional networks for session-based recommendation. In Proceedings of the AAAI conference on artificial intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 4503–4511. [Google Scholar] [CrossRef]
Yang, Y.; Wu, L.; Hong, R.; Zhang, K.; Wang, M. Enhanced graph learning for collaborative filtering via mutual information maximization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 71–80. [Google Scholar] [CrossRef]
Lin, Z.; Tian, C.; Hou, Y.; Zhao, W.X. Improving graph collaborative filtering with neighborhood-enriched contrastive learning. In Proceedings of the ACM Web Conference 2022, Virtual Event, 25–29 April 2022; pp. 2320–2329. [Google Scholar] [CrossRef]
Yang, Z.; Wu, J.; Wang, Z.; Wang, X.; Yuan, Y.; He, X. Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion. Adv. Neural Inf. Process. Syst. 2024, 36, 24247–24261. [Google Scholar]
Xie, Z.; Liu, C.; Zhang, Y.; Lu, H.; Wang, D.; Ding, Y. Adversarial and contrastive variational autoencoder for sequential recommendation. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 449–459. [Google Scholar] [CrossRef]
Zhu, Y.; Wu, L.; Guo, Q.; Hong, L.; Li, J. Collaborative large language model for recommender systems. In Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024; pp. 3162–3172. [Google Scholar] [CrossRef]
Sun, F.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar] [CrossRef]
Hao, Y.; Zhao, P.; Fang, J.; Qu, J.; Liu, G.; Zhuang, F.; Sheng, V.S.; Zhou, X. Meta-optimized joint generative and contrastive learning for sequential recommendation. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), IEEE, Utrecht, The Netherlands, 13–17 May 2024; pp. 705–718. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. arXiv 2012, arXiv:1205.2618. [Google Scholar]
Tao, Z.; Liu, X.; Xia, Y.; Wang, X.; Yang, L.; Huang, X.; Chua, T.S. Self-supervised learning for multimedia recommendation. IEEE Trans. Multimed. 2022, 25, 5107–5116. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html (accessed on 19 January 2025).
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. The overview of the proposed method.

Figure 2. Multimodal data-driven user preferences.

Figure 3. Metadata-based user preference modeling.

Figure 4. An evaluation of GC4MRec performance with varying hyperparameter configurations across multiple datasets.

Figure 5. Performance comparison curves against a state-of-the-art model on the Baby database.

Figure 6. Ablation experiment.

Table 1. Statistics of the datasets.

Dataset ¹	# Users	# Items	# Interactions	Density
Clothing	39,387	23,033	237,488	0.00026
Sports	35,598	18,357	256,308	0.00039
Baby	19,445	7050	139,110	0.00101

¹ Datasets can be accessed at https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html (accessed on 18 January 2025).

Table 2. Performance comparison: models ranked by metric (with improvement).

R@10 Ranking				R@20 Ranking
Model	Sports	Baby	Clothing	Model	Sports	Baby	Clothing
GC4MRec (Ours)	0.0784	0.0677	0.0645	GC4MRec (Ours)	0.1178	0.1074	0.0959
DGVAE	0.0753 _4.12%	0.0636 _6.45%	0.0619 _4.20%	DGVAE	0.1127 _4.53%	0.1009 _6.44%	0.0917 _4.58%
FREEDOM	0.0717 _9.34%	0.0627 _7.97%	0.0629 _2.54%	FREEDOM	0.1089 _8.17%	0.0992 _8.27%	0.0941 _1.91%
SLMRec	0.0671 _16.84%	0.0504 _34.33%	0.0447 _44.30%	SLMRec	0.0998 _18.04%	0.0768 _39.84%	0.0662 _44.86%
MMSSL	0.0652 _20.25%	0.0620 _9.19%	0.0492 _31.10%	MMSSL	0.0981 _20.08%	0.0969 _10.84%	0.0783 _22.48%
LATTICE	0.0588 _33.33%	0.0528 _28.22%	0.0483 _33.54%	LATTICE	0.0926 _27.21%	0.0842 _27.55%	0.0726 _32.09%
VBPR	0.0542 _44.65%	0.0450 _50.44%	0.0277 _132.85%	VBPR	0.0849 _38.75%	0.0668 _60.78%	0.0405 _136.79%
LightGCN	0.0527 _48.77%	0.0503 _34.59%	0.0371 _73.85%	LightGCN	0.0832 _41.59%	0.0766 _40.21%	0.0547 _75.32%
MMGCN	0.0401 _95.51%	0.0390 _73.59%	0.0220 _193.18%	MMGCN	0.0632 _86.39%	0.0624 _72.12%	0.0347 _176.37%
MF-BPR	0.0451 _73.84%	0.0346 _95.66%	0.0177 _264.41%	MF-BPR	0.0662 _77.95%	0.0566 _89.75%	0.0252 _280.56%
N@10 Ranking				N@20 Ranking
Model	Sports	Baby	Clothing	Model	Sports	Baby	Clothing
GC4MRec (Ours)	0.0421	0.0360	0.0349	GC4MRec (Ours)	0.0523	0.0462	0.0429
DGVAE	0.0410 _2.68%	0.0340 _5.88%	0.0336 _3.87%	DGVAE	0.0506 _3.36%	0.0436 _5.96%	0.0412 _4.13%
FREEDOM	0.0385 _9.35%	0.0330 _9.09%	0.0341 _2.35%	FREEDOM	0.0481 _8.73%	0.0424 _8.96%	0.0420 _2.14%
MMSSL	0.0369 _14.09%	0.0333 _8.11%	0.0271 _28.78%	MMSSL	0.0462 _13.20%	0.0427 _8.20%	0.0345 _24.35%
SLMRec	0.0354 _18.93%	0.0299 _20.40%	0.0231 _51.08%	SLMRec	0.0445 _17.53%	0.0367 _25.89%	0.0299 _43.48%
LATTICE	0.0311 _35.37%	0.0281 _28.11%	0.0251 _39.04%	LATTICE	0.0399 _31.08%	0.0352 _31.25%	0.0311 _37.94%
VBPR	0.0310 _35.81%	0.0236 _52.54%	0.0155 _125.16%	VBPR	0.0401 _30.42%	0.0299 _54.52%	0.0189 _126.98%
LightGCN	0.0291 _44.67%	0.0250 _44.00%	0.0192 _81.77%	LightGCN	0.0354 _47.74%	0.0335 _37.91%	0.0248 _72.98%
MMGCN	0.0189 _122.75%	0.0207 _73.91%	0.0115 _203.48%	MMGCN	0.0244 _114.34%	0.0284 _62.68%	0.0153 _180.39%
MF-BPR	0.0244 _72.54%	0.0205 _75.61%	0.0100 _249.00%	MF-BPR	0.0301 _73.75%	0.0250 _84.80%	0.0119 _260.50%

Note: Improvement is calculated as (GC4MRec Value − Model Value)/Model Value × 100.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Li, Y.; Wang, H.; Li, J. GC4MRec: Generative-Contrastive for Multimodal Recommendation. Appl. Sci. 2025, 15, 3666. https://doi.org/10.3390/app15073666

AMA Style

Wang L, Li Y, Wang H, Li J. GC4MRec: Generative-Contrastive for Multimodal Recommendation. Applied Sciences. 2025; 15(7):3666. https://doi.org/10.3390/app15073666

Chicago/Turabian Style

Wang, Lei, Yingjie Li, Heran Wang, and Jun Li. 2025. "GC4MRec: Generative-Contrastive for Multimodal Recommendation" Applied Sciences 15, no. 7: 3666. https://doi.org/10.3390/app15073666

APA Style

Wang, L., Li, Y., Wang, H., & Li, J. (2025). GC4MRec: Generative-Contrastive for Multimodal Recommendation. Applied Sciences, 15(7), 3666. https://doi.org/10.3390/app15073666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GC4MRec: Generative-Contrastive for Multimodal Recommendation

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Recommendation

2.2. Contrastive Learning

2.3. Generative Models in Recommendation

3. Methodology

3.1. Preliminaries

3.2. Graph Information Enhancement

3.2.1. Modality-Aware Subgraph Generation

3.2.2. Graph Reconstruction with Resample

3.2.3. Optimization

3.3. Graph Contrastive Learning

3.4. Dual Spatio-Temporal Cross-Residual Module

3.5. Multimodal Fusion

3.5.1. Metadata and Meta-Preference Modeling

3.5.2. Channel Cross-Fusion Convolution (CCFC)

3.6. Prediction and Optimization

4. Experiment

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Baselines

4.1.3. Evaluation Protocols

4.1.4. Implementation Details

4.2. Performance Comparison

4.3. Ablation Study

4.4. Hyperparameter Combination Effectiveness

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI