Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation

Ye, Zhaoyang; Yang, Jingyi; Meng, Fanyu; Li, Manzhou; Zhan, Yan

doi:10.3390/electronics14183687

Open AccessArticle

Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation

by

Zhaoyang Ye

¹,

Jingyi Yang

¹,

Fanyu Meng

^2,3,

Manzhou Li

³ and

Yan Zhan

^3,4,*

¹

Television School, Communication University of China, Beijing 100024, China

²

China Agricultural University, Beijing 100083, China

³

National School of Development, Peking University, Beijing 100871, China

⁴

Artificial Intelligence Research Institute, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3687; https://doi.org/10.3390/electronics14183687

Submission received: 29 August 2025 / Revised: 16 September 2025 / Accepted: 17 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In large-scale recommendation scenarios, achieving high-precision ranking requires simultaneously modeling user interest dynamics and content propagation potential. In this work, we propose a unified framework that integrates a temporal interest modeling stream with a multimodal virality encoder. The temporal stream captures sequential user behavior through the self-attention-based modeling of long-term and short-term interests, while the virality encoder learns latent virality factors from heterogeneous modalities, including text, images, audio, and user comments. The two streams are fused in the ranking layer to form a joint representation that balances personalized preference with content dissemination potential. To further enhance efficiency, we design hierarchical cascade heads with gating recursion for progressive refinement, along with a multi-level pruning and cache management strategy that reduces redundancy during inference. Experiments on three real-world datasets (Douyin, Bilibili, and MIND) demonstrate that our method achieves significant improvements over state-of-the-art baselines across multiple metrics. Additional analyses confirm the interpretability of the virality factors and highlight their positive correlation with real-world popularity indicators. These results validate the effectiveness and practicality of our approach for high-precision recommendation in big data environments.

Keywords:

big data recommendation; dual-stream recommendation framework; multimodal virality encoder; hierarchy classification heads; joint pruning optimization

1. Introduction

With the rapid expansion of new media content and the increasing complexity of user behavior patterns, short-video platforms and news portals have imposed higher requirements on personalized recommendation systems. Click-through rate (CTR) prediction, as a core indicator of recommendation ranking, has traditionally relied on static features such as user profiles and content tags, which are insufficient to capture the dynamic evolution of user behavior. Early CTR prediction models were mainly based on conventional machine learning methods such as logistic regression (LR), which depend on handcrafted static features including user demographics (e.g., age, gender, and location) and content categories [1]. While these static features can partially reflect user interests and content attributes, they are limited in their ability to represent the dynamic changes in user preferences and the temporal propagation characteristics of content. The limitations of traditional CTR methods are rooted not only in static feature modeling but also in the lack of sequential behavioral modeling. For instance, linear models such as LR assume feature independence, thereby failing to capture nonlinear evolutions in user interests. Studies have shown that user click behaviors exhibit strong temporal dependencies, such as purchase decisions being influenced by recent browsing history, which cannot be effectively modeled by LR [2]. Although factorization machines (FMs) are capable of modeling second-order feature interactions, their ability to capture higher-order dynamic interactions remains insufficient [3]. These challenges are particularly evident in modern recommendation scenarios such as short-video and news platforms, where user preferences evolve rapidly and content carries rich multimodal signals. Addressing them requires more powerful models capable of dynamic and high-order representation learning.

In recent years, deep learning has achieved remarkable progress in recommendation systems. Models such as multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and Transformers have been widely adopted in CTR prediction tasks [4]. These models are capable of automatically learning nonlinear feature interactions and generating high-order abstract feature representations [5]. However, despite these advances, existing deep models still suffer from several critical limitations.

First, most approaches fail to adequately capture the dynamic evolution of user behavior. For example, the IARM model applies multi-head attention to represent temporal changes in preferences [6], but its modeling is still restricted to short-term dependencies and cannot fully reflect long-term interest shifts. Similarly, frameworks like HyperCTR [7] introduce hypergraph structures to encode sequential dependencies, yet they primarily focus on static group-level relations while neglecting the fine-grained temporal variability of individual users. In real-world short-video or news scenarios, user interests fluctuate rapidly and can change dramatically within minutes; models lacking robust temporal modeling risk producing outdated or irrelevant recommendations. Second, the integration of multimodal content remains insufficient. Although ContentCTR [8] leverages multimodal Transformer architectures for frame-level CTR prediction, existing methods often emphasize visual and textual signals while underutilizing other modalities such as audio, live comments, or interaction context. This partial fusion limits the system’s ability to capture the true attractiveness and virality of content. In practice, short-video popularity often arises from subtle cross-modal synergies (e.g., a trending audio clip combined with a catchy caption), which cannot be fully modeled by unimodal or weakly fused architectures. Third, real-time recommendation imposes efficiency constraints that many current models cannot meet. For instance, MTLA [9] introduces key-value cache compression to accelerate Transformer inference, but most temporal-attention-based models remain computationally expensive and struggle to balance accuracy with latency. In high-traffic platforms, even slight inefficiencies can result in significant delays, negatively impacting user experience and platform revenue.

In summary, while existing deep CTR models have improved over traditional methods, they still fall short in three crucial aspects: (1) the robust long-term temporal modeling of user interests, (2) comprehensive multimodal content integration, and (3) efficient real-time inference. These limitations highlight the need for a unified framework that simultaneously addresses temporal evolution, multimodal fusion, and scalability for large-scale recommendation scenarios. A unified recommendation framework integrating temporal-attention mechanisms and multimodal Transformer structures is proposed in this work. By modeling both the evolution of user behaviors and content virality, the framework jointly optimizes CTR prediction and recommendation ranking. The primary contributions are as follows:

A multimodal virality-aware module is designed by fusing text, image, audio, and user comment streams to enhance content-level virality assessment.
Propagation potential factors are jointly introduced with click-sequence features to optimize the ranking mechanism, enabling accurate recommendations driven by both “interest” and “popularity”.
Hierarchy classification heads are constructed to generate the final ranking outputs.

2. Related Work

2.1. CTR Prediction and Sequential Modeling Methods

The breakthroughs brought by deep learning to CTR prediction can be summarized in two major aspects: dynamic interest modeling and long-sequence dependency modeling. In terms of dynamic interest modeling, the deep interest network (DIN) leverages local attention to focus on historical behaviors related to candidate items, thus enhancing the specificity of interest representation [10,11]. The deep interest evolution network (DIEN) further incorporates an interest evolution LSTM to capture the “generation–evolution–decay” process of interests, leading to significantly higher prediction accuracy compared to DIN in e-commerce scenarios [12]. For long-sequence dependency modeling, recent studies have demonstrated the effectiveness of self-attention mechanisms in complex sequential domains, such as IoT security [13], financial transaction modeling [14], and cybersecurity anomaly detection [15]. These applications highlight the flexibility and generalizability of self-attention for diverse types of sequential data. Transformer4Rec, in particular, introduces self-attention mechanisms into recommendation systems, efficiently modeling long-term dependencies in user behavior sequences. Compared to RNN-based models, Transformer4Rec achieves faster inference on ultra-long sequences while preserving accuracy, thereby improving both the precision and timeliness of recommendations [16]. Moreover, the TPP-CTR model incorporates temporal point processes (TPPs) to capture the “burstiness” and “periodicity” of user interests in continuous time, reducing CTR prediction error in cold-start scenarios [17,18].

Nevertheless, existing approaches still face challenges. For instance, traditional methods often struggle to capture abrupt changes in user interests (e.g., sudden surges in news clicks), and the integration of sequential features with static ones (such as user profiles) remains an open problem [19,20]. Furthermore, the

O (n^{2})

complexity of Transformer-based models in handling extremely long sequences results in inefficiency [21,22].

2.2. Multimodal Content Understanding and Propagation Modeling

The development of multimodal fusion has progressed through multiple phases. Early approaches, such as MMGCN and GNN-CON, concatenated feature vectors from different modalities but ignored semantic heterogeneity, resulting in limited cross-modal retrieval performance [23,24,25]. In the deep fusion stage, ViLT models image patches as tokens that interact with text tokens, enabling end-to-end vision–language modeling and improving multimodal classification accuracy [26,27]. FLAVA employs pretraining to learn unified cross-modal representations, achieving an average score of 69.92 across multimodal tasks, outperforming competing methods [28]. In domain adaptation, VideoBERT treats video frames as “visual tokens” to capture temporal semantics, enhancing CTR performance in short-video recommendation [29], while CLIP leverages image–text contrastive learning to learn joint multimodal representations, significantly boosting multimodal understanding and generation capabilities [30].

In terms of propagation modeling, traditional SIR models fail to account for user heterogeneity. Although effective for describing topic diffusion in online forums, their homogeneity assumptions prevent applicability to real-world populations with complex spatial and group structures [31]. By contrast, diffusion models used for predicting content virality have achieved significantly higher accuracy [32]. Additionally, popularity prediction methods based on diffusion dynamics analyze propagation paths and velocities to estimate potential influence. For example, certain studies have employed diffusion models to simulate content propagation, thereby substantially improving the accuracy of virality estimation for short videos [33].

2.3. Content Popularity and Virality Modeling in Recommendation Systems

The modeling of content popularity and virality is critical for enhancing the adaptivity and precision of recommendation systems, as it captures temporal evolution patterns and user-driven interaction mechanisms [34]. Recent advances have integrated deep learning, temporal analysis, and network modeling to establish multidimensional frameworks for dynamic content evaluation, significantly improving system responsiveness to fluctuations in user interests and content dissemination trends.

Two major challenges in popularity modeling involve time-decay effects and burst prediction. Traditional methods rely on sliding-window statistics of click counts but cannot effectively capture abrupt popularity surges, such as traffic spikes triggered by breaking news [35]. Deep learning-based approaches provide potential solutions. For instance, generative adversarial networks (GANs) employ adversarial training between generators and discriminators to produce realistic synthetic data and have been widely applied in content generation. In popularity modeling, self-attention mechanisms capture long-range dependencies in user behavior sequences, enabling more accurate prediction of content popularity trends [36,37]. Similarly, self-supervised learning and knowledge distillation offer further improvements. Self-supervised learning automatically generates training labels from unlabeled data, thereby reducing annotation costs and enhancing generalization, while knowledge distillation transfers knowledge from large teacher models to compact student models, improving efficiency without sacrificing accuracy [38]. In popularity and virality modeling, self-supervised learning can leverage large-scale unlabeled user interaction data to learn intrinsic representations, and knowledge distillation can simplify complex predictive models to enhance computational efficiency [39].

3. Materials and Methods

3.1. Data Collection

In this study, the construction of datasets covered two representative scenarios, namely short-video platforms and news platforms, to ensure the adaptability and generalization of the proposed recommendation framework with respect to multimodal and temporal characteristics, as shown in Table 1. The short-video data were primarily obtained from publicly available datasets on Douyin and Bilibili, collected through a combination of open platform interfaces and web-crawling tools, with the collection period spanning from January 2023 to December 2024. During the collection process, particular emphasis was placed on gathering multimodal information closely related to user behaviors and content virality. These included video title text, cover images, background audio features, comments and bullet-screen texts, upload times, video durations, and system-generated popularity indicators such as view counts, likes, shares, and comments. Such features not only reflected the static semantic attributes of the content but also revealed its temporal propagation potential and user interaction feedback. Textual information was processed through Chinese word segmentation and subsequently encoded using BERT to obtain semantic representations. Image features were extracted using ResNet and ViT, while audio data were modeled via Mel-spectrograms and convolutional networks. Meanwhile, bullet-screen comments and user replies were regarded as real-time interactive signals that revealed users’ instantaneous responses.

The news data consisted of the MIND news recommendation dataset combined with a self-constructed news click-log dataset, covering the period from January 2022 to December 2023. The MIND dataset, provided by Microsoft News, included news text content, categorical labels, publication times, and user click logs, thereby reflecting the dynamic nature of user interests in real-world news recommendation scenarios. The self-constructed dataset was collected through collaborations with media platforms and contained news titles, full texts, categorical tags, publication timestamps, user CTR, browsing durations, and sequences of user clicks. During the collection and preprocessing of all the datasets, we performed rigorous deduplication by removing duplicate items based on text hashing and visual similarity checks, and by filtering out consecutive identical actions from the same user. This procedure ensures objective evaluation and guarantees that the model is tested on truly unseen content and interactions. In addition, the raw news texts were standardized through segmentation, denoising, tokenization, and vectorization to enhance semantic consistency. User behavior logs were aligned using timestamps to form continuous click sequences throughout the news publishing cycle, enabling the modeling of dynamic interest evolution. Furthermore, diversity and balance were emphasized in the collection process to guarantee coverage of multiple domains such as society, finance, technology, and entertainment, thereby enhancing the applicability of the dataset to cross-domain recommendation scenarios.

3.2. Data Augmentation

During the training of text-GNN models, the quality and consistency of both image and text data directly influence the performance of the model. Consequently, preprocessing and augmentation represent critical steps in the training pipeline. The proper structuring of multimodal data reduces noise and redundancy while improving robustness and generalization. Common preprocessing techniques include text cleaning, spelling correction, and image size normalization, which provide cleaner and more structured inputs. In addition, multimodal consistency checking and image–text matching augmentation enhance cross-modal comprehension, further improving adaptability to diverse data variations, increasing training diversity, and reducing overfitting.

Text cleaning constitutes a fundamental step in text preprocessing. This process removes invalid digits, punctuation, emoticons, and irrelevant tags to minimize noise [40]. Since the dataset was collected from platforms such as Weibo and Xiaohongshu, a large number of noisy labels and emoticons were present. Such noise adversely affects model training. Furthermore, user-generated content often contains spelling errors, which compromise semantic understanding, hinder the correct mapping of tokens to vector space, and reduce model accuracy [41]. Thus, spelling correction is essential. Correction methods may rely on dictionary lookup, statistical language models, or deep learning approaches. Deep learning-based methods leverage contextual information to infer the most likely spelling. The maximum likelihood principle is commonly adopted, in which a potentially erroneous word s is identified, and the best candidate correction

w_{b}

is selected from a candidate set

C (s)

, as expressed in the following formula [41]:

w_{b} = a r g \max_{w_{i} \in C (s)} P (s | w_{i}) P (w_{i}) .

(1)

Here,

P (s | w_{i})

denotes the probability of generating the erroneous string s instead of

w_{i}

,

P (w_{i})

represents the prior probability of generating the candidate word, and

C (s)

returns valid candidate words from the dictionary W. In the error model, the edit distance

D (s, w)

between the erroneous word s and the candidate w is a positive real number. To fit a probabilistic framework, it can be transformed by taking the negative logarithm [41]:

P (s | w_{i}) = - \log D (s, w) .

(2)

In multimodal learning, image data must often be processed jointly with textual data to enable efficient batch processing across heterogeneous hardware platforms. Standardizing input image sizes mitigates complexity variations and provides uniform input for subsequent networks. An image is represented as

x \in R^{H \times W \times C}

, where C denotes the number of channels and

(H, W)

indicates the resolution [42]. Normalization typically involves scaling images to a fixed dimension. Smaller images are padded to the target size, while oversized images are cropped either centrally or randomly to preserve the aspect ratio. A commonly employed technique is min–max scaling:

I_{n o r m} = \frac{I - \min (I)}{\max (I) - \min (I)},

(3)

where I represents the raw image,

\min (I)

and

\max (I)

are its minimum and maximum pixel values, and

I_{n o r m}

is the normalized image.

Ensuring semantic consistency between textual and visual modalities is essential. Multimodal fusion has been regarded as the integration of visual and textual features [43]. Multimodal consistency checking involves validating the reliability and accuracy of fused information. Variations in modality-specific noise, data quality, and formats may lead to inconsistency in fusion outcomes. Semantic similarity between image and text representations can be measured to evaluate consistency [44]. Text vectors are extracted using BERT as follows:

h_{i}^{t} = B E R T (t_{i}),

(4)

where

t_{i}

denotes the i-th input sentence and

h_{i}^{t}

is the corresponding embedded feature. For visual data, semantic features are typically processed through an attention mechanism:

u^{v} = U^{T} \tanh (W^{v} h^{v} + b^{v}),

(5)

where

W^{v}

is a weight matrix,

b^{v}

is a bias term,

U^{T}

is a transposed weight vector, and u is a scoring function assessing the importance of each feature vector. A softmax function is then applied to compute attention weights. Once textual and visual semantic representations are obtained, cosine similarity measures modality alignment:

s = \frac{s_{t} \cdot s_{v}}{∥s^{t}∥ \times ∥s^{v}∥},

(6)

where

s^{t}

and

s^{v}

denote text and visual feature sequences, respectively. The similarity

s \in [- 1, 1]

, and higher values indicate stronger alignment. The similarity can be mapped to

[0, 1]

through a sigmoid function:

p^{s} = s i g m o i d (s) .

(7)

To enhance semantic alignment between text and images, image–text matching augmentation serves as an effective strategy. Image–text matching refers to retrieving relevant text given an image description or identifying semantically similar images [45]. Weakly perturbed samples are generated through slight alterations in images and text, increasing training diversity and robustness. Perturbations may involve image rotations, scaling, or cropping, as well as textual synonym replacement or word-order shuffling. This augmentation allows models to learn more robust matching rules and improve accuracy. Attention mechanisms are further applied to capture intrinsic feature correlations, guided by global representations. The corresponding formula is [45]:

\tilde{V} = A t t (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(8)

where Q is the query matrix, K is the key, and V is the value. This formulation facilitates the capture of essential semantic cues within images. By introducing controlled perturbations in both modalities, weakly perturbed samples are created, effectively improving model robustness and enhancing its capacity for semantic matching.

3.3. Proposed Method

3.3.1. Overall

In the overall framework design, the processed multimodal input data are directly fed into the proposed Dual-Stream recommendation architecture, forming an end-to-end modeling and inference pipeline, as shown in Figure 1. Specifically, the user behavioral sequence features and the multimodal features of candidate content are separately passed into two independent modeling pathways. The temporal interest modeling stream receives the user click sequence, where position embeddings and timestamp encodings are employed to transform behavioral logs into temporally sensitive representations. Through multiple layers of self-attention, the long-term dependencies and short-term preferences of user interests are captured, resulting in a dynamic interest vector. Simultaneously, the content virality modeling stream integrates textual, visual, acoustic, and comment-based modalities into a multimodal transformer, in which cross-modal attention mechanisms enable semantic alignment across modalities. The resulting fused representation is subsequently mapped into latent virality factors within the virality estimation submodule, providing additional signals for ranking.

The outputs of these two streams are fused in the ranking layer by concatenating their hidden representations, followed by a feed-forward transformation that projects them into a shared latent space. This joint representation captures both the temporal dynamics of user interests and the multimodal cues of content virality, enabling the ranking module to learn interactions between personalized preferences and propagation potential. This unified representation is delivered into the hierarchy classification heads, where prediction proceeds across multiple levels. At the first level, a binary classification is performed to determine the likelihood of a user click. At the second level, a pairwise mechanism is adopted to learn relative preferences among candidate items. At the third level, a listwise mechanism is applied to optimize the global ranking of the entire candidate set, ensuring both the rationality and distinctiveness of the recommendation outputs.

To further enhance inference efficiency and to mitigate redundant computation across large-scale candidate pools, a joint pruning optimization module is introduced after the ranking stage. This module leverages attention weights and gradient sensitivity analysis to retain only the most informative feature dimensions, while dynamically pruning a subset of attention heads and feed-forward units in the transformer. Such pruning substantially reduces computational overhead without undermining performance. Moreover, pruning policies can be adapted online according to real-time feedback, enabling the model to maintain a balanced trade-off between accuracy and efficiency under varying deployment environments.

3.3.2. Multimodal Virality Encoder

In the design of the multimodal virality modeling module, the inputs include video texts, cover images, audio spectrograms, and comment streams, with the objective of generating high-dimensional representations that reflect the virality potential of content, as shown in Figure 2. The module is constructed on a multimodal transformer, where independent feature extraction and unified embedding are first performed for each modality, followed by cross-modal attention for semantic alignment, and finally fused into a comprehensive virality representation. Specifically, the textual modality is encoded via BERT, producing sentence-level features of dimension

d_{t}

. The visual modality is represented by key frames or cover images processed through ResNet and vision transformers, yielding visual features of dimension

d_{v}

. The acoustic modality is transformed into Mel-spectrograms and further modeled by convolutional networks, producing time–frequency features of dimension

d_{a}

. Comments and bullet-screen data are encoded with a lightweight transformer, producing interaction features of dimension

d_{c}

. All modality-specific features are then projected into a shared

d_{m}

-dimensional space via linear layers, with position encodings and modality-specific identifiers added to distinguish modalities during subsequent self-attention computation.

Within the transformer, each modality-specific representation is updated through multi-head self-attention, formulated as

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(9)

where

Q, K,

and V are derived from input features through learnable transformations. On this basis, cross-modal attention integrates inter-modality correlations, aligning semantics between text and images, while allowing audio and comments to complement emotional cues, thereby enhancing the overall discriminative power. In terms of network configuration, a four-layer transformer encoder is adopted, each with eight attention heads, a hidden dimension

d_{m} = 512

, and a feed-forward expansion dimension 2048, combined with residual connections and layer normalization for training stability. Mathematically, the fused multimodal representation can be expressed as

H = [H_{t}; H_{v}; H_{a}; H_{c}] W_{f},

(10)

where

[;]

denotes the concatenation operation and

W_{f}

represents the fusion weight matrix. To highlight virality-related characteristics, a virality factor estimation function is further introduced as

p = f (H),

(11)

where a fully connected layer followed by a sigmoid activation generates a normalized virality score. The underlying virality factors can be conceptualized as latent variables that encapsulate the propagation potential of content across multiple modalities. More formally, they constitute a nonlinear projection of the joint feature space—encompassing text, image, audio, and user comments—into a low-dimensional subspace optimized for dissemination-related objectives. In this formulation, the virality factors function as proxy variables for real-world popularity indicators, such as share counts or like-growth rates, while preserving the flexibility to capture subtle multimodal cues that may not be directly observable. The resulting score is incorporated into the ranking stage as an auxiliary feature and simultaneously leveraged as an additional supervision signal during training, thereby encouraging the model to learn representations that more faithfully reflect real-world virality dynamics.

3.3.3. Hierarchy Classification Heads

To achieve unified outputs for CTR prediction and ranking optimization, hierarchical classification heads are designed as cascaded multi-task discriminators, with inputs consisting of the fused dual-stream representation

z \in R^{d_{m}}

and the virality factor

p \in (0, 1)

from the multimodal virality encoder, as shown in Figure 3.

Structurally, a four-stage cascade is employed, where each stage comprises convolutional feature aggregation combined with gated MLPs, integrated with corresponding prior constraints

P r i o r_{k}

. In the cascade heads, we adopt a gating recursion mechanism to progressively refine the prediction across layers. Concretely, the output of the l-th head is passed through a sigmoid gate that controls how much information is propagated to the

(l + 1)

-th head. This design allows each stage to selectively emphasize salient signals while suppressing noise, and the recursive structure ensures that residual information is preserved. Formally, the recursion can be written as

h_{l + 1} = σ (W_{g} h_{l}) ⊙ f (h_{l}) + h_{l},

(12)

where

f (\cdot)

denotes the transformation in the current head, and ⊙ is element-wise multiplication. This establishes a progressive relationship denoted as “Cascade head k–Prior k–level k”, as illustrated in the module schematic. In the first stage, the visual branch processes grid features of resolution

14 \times 14 \times 256

, followed by a

3 \times 3

convolution (

256 \to 128

, stride 1) and another

3 \times 3

convolution (

128 \to 64

), with global average pooling resulting in

v_{1} \in R^{64}

. The textual–acoustic–comment branch undergoes self-attention pooling and linear mapping to

c_{1} \in R^{64}

. The concatenation of the two branches is then processed by an MLP

(128 \to 64 \to 32)

to yield

s_{1} \in R^{32}

, producing a logit

y_{1} = w_{1}^{⊤} s_{1}

. The second stage reduces the resolution to

7 \times 7 \times 256

, with convolutions

3 \times 3

(

256 \to 128

) and

1 \times 1

(

128 \to 64

), while the MLP remains

(128 \to 64 \to 32)

. The third stage employs

4 \times 4 \times 128

features with

3 \times 3

convolution and global pooling, while the fourth stage directly applies an MLP

(d_{m} + 1 \to 128 \to 64 \to 1)

to the concatenated vector of z and p, yielding

y_{4}

. The gating recursion in the cascade is defined as

z_{k + 1} = z_{k} + σ (β_{k} y_{k}) U_{k} s_{k}, z_{1} = z,

(13)

where

σ

denotes the sigmoid function and

U_{k}

is a linear transformation. This formulation amplifies effective residual information in low-uncertainty samples, stabilizing subsequent discrimination. The joint scoring with priors is defined as

r_{k} = α_{k} y_{k} + (1 - α_{k}) \log π_{k}, π_{k} = P r i o r_{k},

(14)

with learnable

α_{k} \in (0, 1)

. Denoting the binary cross-entropy loss of stage k as

L_{ce} (y_{k})

and the KL regularization with the prior as

L_{kl} (π_{k} ‖ σ (y_{k}))

, the following inequality holds:

- \log σ (r_{k}) \leq α_{k} L_{ce} (y_{k}) + (1 - α_{k}) L_{kl} (π_{k} ‖ σ (y_{k})),

(15)

which follows from Jensen’s inequality. This demonstrates that the cascade heads preserve discriminative capacity while leveraging priors to reduce estimation variance, thereby enhancing robustness against cold-start and bursty samples. The final ranking score is computed via dual-objective aggregation:

s = \sum_{k = 1}^{4} λ_{k} r_{k} + γ g (p), g (p) = \log \frac{p}{1 - p},

(16)

where

λ_{k}, γ > 0

. During training, pairwise and listwise losses are combined to align relative preferences with global ranking:

L_{pair} = - \log σ (s_{i} - s_{j}), L_{list} = - \sum_{i} \frac{e^{s_{i}}}{\sum_{j} e^{s_{j}}} \log \frac{e^{s_{i}}}{\sum_{j} e^{s_{j}}} .

(17)

Since

z_{k + 1}

is constructed via gated residuals of

y_{k}

, variance decomposition based on the law of total variance yields

Var (z_{k + 1}) = E [Var (z_{k + 1} ∣ y_{k})] + Var (E [z_{k + 1} ∣ y_{k}])

, where the gating term

σ (β_{k} y_{k})

suppresses the conditional variance in noisy intervals, reducing generalization error. Integration with

P r i o r_{k}

across multi-resolution levels (

14 \times 14

,

7 \times 7

,

4 \times 4

,

1 \times 1

) enables the progressive filtering of candidate sets. Priors provide coarse-grained virality and domain constraints, while cascade heads deliver fine-grained discrimination. Their multiplicative or weighted fusion is theoretically equivalent to linear separability in logarithmic space, thereby improving consistency and stability in click prediction and ranking while maintaining computational efficiency, as channel widths remain bounded (

256 \to 64

).

3.3.4. Joint Pruning Optimization Module

The joint pruning optimization module is designed to reduce computational and memory overhead while maintaining recommendation performance, thereby enabling real-time scalability in online recommendation scenarios. This module integrates pruning strategies with attention cache optimization, progressively eliminating redundant feature channels and low-contribution attention heads, while leveraging cross-GPU/CPU cache scheduling to minimize inference latency, as shown in Figure 4.

Architecturally, it comprises three pruning sub-networks and one cache management unit, where each sub-network targets feature channels, attention heads, or residual pathways. For the input fused tensor

x \in R^{B \times H \times W \times C}

, the first sub-network applies channel pruning after convolutional layers, with initial

C = 256

reduced to

C^{'} = 128

via

1 \times 1

projection and pruning mask

M_{c} \in {0, 1}^{C}

:

x^{'} = x W_{c} ⊙ M_{c}, W_{c} \in R^{C \times C^{'}},

(18)

where ⊙ denotes element-wise multiplication. The second sub-network prunes multi-head attention, reducing heads from

h = 8

to

h^{'} = 4

via learnable gating

g \in {[0, 1]}^{h}

:

A t t^{'} (Q, K, V) = \sum_{i = 1}^{h} g_{i} softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}, {| g |}_{0} = h^{'},

(19)

ensuring that only high-weight subspaces contribute to fusion. The third sub-network applies stochastic pruning on residual branches, discarding low-sensitivity paths with probability

ρ_{l}

:

z_{l + 1} = f (z_{l}) + δ_{l} z_{l}, δ_{l} \sim Bernoulli (1 - ρ_{l}),

(20)

where

f (\cdot)

is a nonlinear mapping. This reduces the computational load while retaining major gradient flow. The cache unit implements dynamic GPU/CPU migration for key-value caches, downsampling or swapping low-importance token blocks to reduce memory usage:

C_{t + 1} = Φ (C_{t}, M_{p}), M_{p} \in {0, 1}^{L},

(21)

where

C_{t}

denotes the cache state at step t,

M_{p}

the pruning mask, and

Φ

the migration operator. Mathematically, pruning can be formalized as a constrained optimization problem, minimizing the complexity

Ω (M)

under bounded error

ϵ

:

\min_{M} Ω (M), s . t . L (M) - L (M^{*}) \leq ϵ,

(22)

where

M

is the pruned model,

M^{*}

the full model, and

L

the training loss. As

Ω

is monotonic with channel count, attention heads, and residual depth, pruning yields significant efficiency gains without exceeding

ϵ

. Integration with the multimodal virality encoder and hierarchical heads is realized via shared optimization objectives, where pruned features

x^{'}

and virality encoding H are jointly input to the classification heads:

s = ψ (x^{'}, H, p),

(23)

with

ψ

denoting the classification mapping and p the virality score.

The joint pruning optimization module generates binary masks based on learned importance scores at different granularity levels (channels, attention heads, and residual paths). The importance scores are computed from weight magnitudes and activation statistics, and a differentiable thresholding function is applied to obtain the pruning mask. During training, the mask is updated jointly with the model parameters, which allows the pruning strategy to gradually converge toward retaining only the most informative components. This joint optimization ensures that pruning does not degrade ranking accuracy while significantly improving inference efficiency. This joint pruning design dynamically reallocates computation, sustaining low latency and high accuracy in high-concurrency scenarios such as short video and news recommendations, while preserving CTR prediction precision at reduced inference cost.

4. Results and Discussion

4.1. Hyperparameter Settings

To ensure reproducibility and provide clarity on the training process, we summarize the key hyperparameters used in our experiments. The model is optimized using the Adam optimizer with an initial learning rate of 1 × 10⁻⁴, which is linearly warmed up for the first 5% of training steps and then decayed with a cosine schedule. We set the batch size to 128 for all experiments, and the number of training epochs is fixed at 30 unless otherwise specified. A dropout rate of 0.2 and an L2 regularization coefficient of

1 \times 10^{- 5}

are applied to prevent overfitting. For hyperparameter tuning, we adopt a two-stage strategy. First, a coarse grid search is performed over a predefined range (e.g., learning rate

\in {1 \times 10^{- 3}, 1 \times 10^{- 4}, 5 \times 10^{- 5}}

; batch size

\in {64, 128, 256}

), using a held-out validation set to select promising configurations. Second, a finer search around the best-performing settings is carried out with early stopping based on validation loss. All the reported results are obtained with the best configuration found under this procedure. In addition, each dataset is partitioned into 70% training, 10% validation, and 20% testing sets. For sequential logs, we adopt a temporal split to ensure that earlier interactions are used for training while later interactions are reserved for validation and testing, thereby preventing data leakage and simulating realistic recommendation scenarios.

4.2. Evaluation Metrics

In the experimental evaluation, posts and comments related to “sugar-free”, “diet”, and “light food” were crawled from platforms such as Weibo and Xiaohongshu. Paired text–image data were filtered and combined with expert annotations to construct a pseudo-health misinformation identification dataset, enabling a comprehensive assessment of the proposed model’s accuracy in real-world health information recognition tasks. The primary evaluation metrics included the classification accuracy (Accuracy), precision (Precision), recall (Recall), F1-score (F1-Score), and area under the receiver operating characteristic curve (AUROC). These metrics were employed to quantify the predictive accuracy, discriminative ability, and overall performance. The mathematical definitions of the evaluation metrics are given as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(24)

Precision = \frac{T P}{T P + F P},

(25)

Recall = \frac{T P}{T P + F N},

(26)

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}, where Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N},

(27)

AUROC = \int_{0}^{1} TPR (F P R) d (FPR),

(28)

where

T P

,

T N

,

F P

, and

F N

represent true positives, true negatives, false positives, and false negatives, respectively.

T P R

denotes the true positive rate, representing the proportion of correctly classified positive instances, while

F P R

denotes the false positive rate, indicating the proportion of negative instances incorrectly classified as positive.

A U R O C

measures the discriminative capability of the model under different classification thresholds by calculating the area under the

T P R

-

F P R

curve.

4.3. Baseline

In the experiments, comparisons were conducted with three categories of models: text-based models (TextCNN [46], BERT [47]), multimodal models (ViLBERT [48], VisualBERT [49], CLIP [50]), and GNN models (TextGCN [51], HeteroGNN [52]).

TextCNN, built upon convolutional neural networks, extracts local textual features through multi-channel convolution operations, offering high computational efficiency and ease of implementation, which makes it particularly suitable for short-text classification tasks. BERT employs bidirectional Transformer encoders to capture contextual information during pretraining, thereby enabling deeper semantic understanding and achieving outstanding performance across a wide range of natural language processing tasks. ViLBERT extends the BERT architecture by introducing a dual-stream structure that processes visual and textual inputs separately while employing co-attention mechanisms to facilitate cross-modal interactions, making it effective for tasks such as visual question answering. VisualBERT provides a more streamlined framework for vision–language modeling by directly integrating visual and textual information, demonstrating strong performance in tasks like image captioning. CLIP leverages large-scale image–text pairs for contrastive pretraining, aligning visual and textual representations in a shared embedding space, which enables zero-shot classification and versatile cross-modal applications. TextGCN formulates text classification as a graph learning problem, constructing a text graph to capture semantic relations, thereby showing particular strength in long-text classification. HeteroGNN, designed for heterogeneous graphs with multiple types of nodes and edges, is well-suited for multimodal information fusion, offering strong representational capacity in complex multi-source scenarios.

4.4. Overall Performance Comparison with Baseline Models

The purpose of this experiment is to validate the overall effectiveness and superiority of the proposed method in health information recognition tasks through a comprehensive comparison with multiple representative baseline models.

As shown in Table 2, and Figure 5 and Figure 6, the traditional text convolutional model TextCNN demonstrates the ability to capture local features in short-text scenarios, maintaining accuracy and recall around 0.82. However, its overall performance is constrained by the inability to model long-range dependencies. In contrast, BERT, leveraging a bidirectional Transformer structure, enables deeper contextual understanding, thereby outperforming TextCNN across all metrics, with the AUROC improved to 0.902, indicating stronger discriminative capability in semantic modeling. Multimodal models such as ViLBERT and VisualBERT establish interactions between visual and textual modalities, resulting in higher Precision and Recall compared with TextCNN and BERT, reaching 0.847 and 0.855, respectively, which highlights the importance of cross-modal modeling for complex health information recognition. CLIP further relies on large-scale image–text contrastive pretraining, offering powerful cross-modal alignment capabilities. Consequently, it surpasses other multimodal models across all metrics, achieving an AUROC of 0.918, reflecting its advantage in open-domain cross-modal tasks. Graph neural network models, including TextGCN and HeteroGNN, demonstrate the potential of graph structures in modeling semantic relations and heterogeneous information sources. Among them, HeteroGNN, by better exploiting different types of nodes and edges, achieves superior overall results compared to TextGCN, with both the Precision and Recall exceeding 0.85. Ultimately, the proposed model achieves the best results across all metrics, with the accuracy reaching 0.926 and AUROC 0.953, significantly surpassing all baselines and fully validating its comprehensive advantages in complex multimodal health information recognition.

From a theoretical perspective, these results can be explained by the mathematical characteristics of the respective models. TextCNN extracts local features through convolutional kernel sliding, which is suitable for short segments but lacks global dependency modeling, thereby limiting performance. BERT employs a self-attention mechanism to realize global interactions between arbitrary sequence positions, enabling the better approximation of complex contextual distributions in the probability space and improving overall performance. ViLBERT and VisualBERT introduce cross-modal attention weight allocation, allowing interactions between visual and textual features in a joint embedding space. However, their performance improvement remains limited due to the scale of pretraining data and parameter-sharing strategies. CLIP, by adopting a contrastive learning objective, maximizes the similarity between matched image–text pairs while minimizing that of mismatched pairs in high-dimensional embedding space. This geometric optimization mechanism enhances modality alignment, leading to superior results. Graph neural networks apply graph convolution or message-passing mechanisms to map node features into neighborhood spaces, thereby learning higher-order dependencies. HeteroGNN, in particular, distinguishes different types of edges and nodes in its adjacency matrix operations, yielding better performance in complex scenarios than TextGCN. Building upon these mechanisms, the proposed method integrates dynamic interest evolution and multimodal propagation modeling. Through joint optimization of the loss function and hierarchical feature representation, the model maintains more stable decision boundaries in embedding space, theoretically ensuring stronger generalization ability and robustness.

4.5. Ablation Study on Different Modules of the Proposed Model

The purpose of this experiment is to validate, through ablation, the independent role and combined contribution of the three key modules in the proposed model, clarifying the importance of each module to the overall performance improvement.

As shown in Table 3 and Figure 7, the full model achieves the best performance across all metrics, with an accuracy of 0.926, precision of 0.909, recall of 0.903, F1-score of 0.906, and AUROC of 0.953, demonstrating that the overall framework has significant advantages in multimodal health information recognition tasks. When the Multimodal Virality Encoder is removed, the accuracy drops to 0.891 and the F1-score falls to 0.869, indicating its crucial role in capturing information propagation potential and cross-modal interactions. Without the Hierarchy Classification Heads, the performance drops even more noticeably, with the accuracy decreasing to 0.887 and AUROC to 0.917, showing that the hierarchical classification structure refines semantic boundaries layer by layer, ensuring consistency between global ranking and local discrimination. Removing the Joint Pruning Optimization Module results in an accuracy of 0.894 and an F1-score of 0.873. Although the decline is smaller, it still highlights this module’s value in improving efficiency and maintaining robustness. Overall, all three modules are indispensable for performance, though their contributions vary in degree.

From a theoretical perspective, the performance differences among the modules can be explained by their mathematical properties. The Multimodal Virality Encoder, through cross-modal attention mechanisms and the modeling of latent propagation factors, enhances the discriminability of inputs in high-dimensional embedding space. Its absence leads to insufficient cross-modal feature alignment, weakening overall performance. The Hierarchy Classification Heads mathematically correspond to constructing layered loss optimization and multi-level supervision, ensuring consistency in nested probability distributions and sharpening semantic boundaries. Without them, the model cannot effectively capture global and local hierarchical dependencies. The Joint Pruning Optimization Module applies joint sparsity constraints and structural optimization to prune redundant paths and stabilize gradient propagation in parameter space. Its main contribution lies in improving computational efficiency and enhancing generalization. Although its direct impact on accuracy is limited, it ensures robustness in large-scale training and real-world deployment. The final results show that multi-module joint optimization not only improves discriminative performance but also theoretically ensures robustness and scalability in complex tasks.

4.6. Robustness Analysis Under Noisy and Imbalanced Data Conditions

The objective of this experiment is to evaluate the robustness of the proposed model under conditions of noisy samples and imbalanced data distributions, thereby examining its applicability and stability in complex real-world environments.

As shown in Table 4, when exposed to 20% noisy labels, the model achieves an accuracy of 0.884 and a recall of 0.861, indicating that it maintains strong discriminative power in the feature space even with partially incorrect annotations. Under the condition of 30% data imbalance, the model attains an accuracy of 0.893, with Precision and Recall both remaining around 0.87, suggesting that it retains a balanced predictive capability despite uneven class distributions. However, when noise and imbalance increase simultaneously to 50%, performance declines most significantly, with the accuracy decreasing to 0.861 and AUROC reduced to 0.903, demonstrating that extreme conditions weaken generalization ability, though the performance remains within an acceptable range. In comparison, the complete model with clean data achieves the best results, with an accuracy of 0.926 and AUROC of 0.953, confirming that the designed architecture can maximize its potential under ideal conditions.

From a theoretical perspective, these results can be attributed to the mathematical characteristics of the model. In the presence of noisy labels, the multimodal feature fusion and graph-structured propagation mechanisms reduce the impact of erroneous annotations on local weight allocation within the embedding space, thereby maintaining a stable decision boundary. Under imbalanced data distributions, symbol drift modeling and attention mechanisms assign adaptive importance to different modalities, mitigating the dominance of majority class samples on gradient updates, which preserves balanced Precision and Recall. When noise and imbalance are combined, error propagation and gradient shifts interact, intensifying distributional deviations in the embedding space and reducing discriminative capacity. Nevertheless, through the joint optimization of the loss function and hierarchical feature representations, the model sustains global geometric stability, ensuring that relatively high performance is maintained even in extreme conditions.

4.7. Discussion

In the discussion, emphasis is placed on the practical value and potential impact of the proposed dual-stream recommendation framework in real-world application scenarios. Taking short video platforms as an example, users are exposed daily to thousands of newly uploaded pieces of content, ranging from highly entertaining lightweight clips to knowledge-oriented and news-related materials. Traditional recommendation approaches typically rely only on historical user click behaviors, which often leads to homogeneous recommendations and filter bubble effects. By contrast, the proposed method incorporates both the temporal interest modeling stream and the multimodal virality encoder, allowing the system not only to capture both short-term and long-term preference dynamics but also to dynamically assess the potential virality of candidate items. For instance, when a newly uploaded video rapidly accumulates likes and shares, its virality potential is sensitively detected by the model and integrated into the ranking stage along with user-specific preferences, thereby enabling recommendations that balance individual interest with global content popularity. This mechanism holds direct implications for hot topic tracking and user experience enhancement.

In the context of news recommendation, the framework demonstrates equally significant advantages. The value of news content is strongly dependent on timeliness and virality, and relying solely on historical user interests tends to overlook breaking events or trending topics. Through the multimodal virality encoder, textual content, accompanying images, publication times, and user comments are jointly modeled, enabling the system to promptly identify which news items are more likely to spread and attract greater user attention. This ensures timeliness and relevance in recommendations during emergent events. Additionally, the hierarchical classification heads refine the process from click prediction to global ranking, ensuring consistency between individual preferences and overall recommendation lists, while mitigating the excessive promotion of trending items and preventing the complete neglect of niche content. The introduction of the joint pruning optimization module allows this complex model to achieve efficient inference under large-scale online settings, fulfilling real-time requirements, especially in high-concurrency environments such as news applications and short video platforms. Thus, the framework demonstrates not only advantages in experimental metrics but also comprehensive capacity in accuracy, timeliness, and efficiency in practical deployment.

4.8. Limitations and Future Work

Although the proposed dual-stream recommendation framework achieves promising results in multimodal feature modeling and temporal interest capture, certain limitations remain. First, the model design relies on the completeness and quality of multimodal features, yet in real-world applications, user-generated content often suffers from missing modalities or noise. Under such circumstances, the multimodal virality encoder may fail to fully realize its potential, thereby affecting the stability of recommendation outcomes. Second, while the joint pruning optimization module significantly improves inference efficiency and computational cost, its dynamic adjustment depends on the stability of system feedback. In high-concurrency scenarios, delayed or fluctuating feedback may undermine the optimality of pruning strategies, ultimately leading to decreased recommendation accuracy.

Future research may be carried out in several directions. On the data side, more robust multimodal representation learning methods should be explored to ensure that the model maintains high predictive accuracy even when modalities are missing or noisy. On the modeling side, closer coupling of content generation and recommendation tasks could be considered, allowing the system to proactively predict and generate potential trending content, thereby enhancing user experience and platform engagement. Moreover, cross-regional and cross-cultural recommendation transfer deserves further attention, particularly in multilingual environments where significant differences exist in user interest patterns and content propagation dynamics. Ensuring the model’s generalization across such diverse contexts remains an open challenge to be addressed.

5. Conclusions

In this work, we introduced a high-precision recommendation framework that simultaneously models temporal interest dynamics and content virality. The framework is designed with two complementary streams: a temporal stream that captures the evolution of user preferences through self-attention-based sequential modeling, and a multimodal virality encoder that learns latent virality factors from text, image, audio, and user comment modalities. The outputs of the two streams are fused in the ranking layer to construct a joint representation that balances personalized interests with dissemination potential. To further enhance effectiveness and efficiency, we developed hierarchical cascade heads with gating recursion, which progressively refine prediction signals, as well as a multi-level pruning and cache management strategy that reduces redundancy during inference. Together, these components enable the model to achieve both strong predictive performance and practical scalability. Comprehensive experiments on three large-scale real-world datasets—Douyin, Bilibili, and MIND—show that our method consistently surpasses competitive baselines across multiple evaluation metrics. Moreover, interpretability analyses of the virality factors confirm their correlation with real-world popularity indicators such as share counts and like-growth rates, highlighting their explanatory value for understanding recommendation outcomes. These results validate the proposed framework as an accurate, efficient, and interpretable solution for recommendation in big data environments.

Author Contributions

Conceptualization, Z.Y., J.Y., M.L., and Y.Z.; Data Curation, F.M.; Funding Acquisition, Y.Z.; Investigation, M.L.; Methodology, Z.Y. and J.Y.; Project Administration, Y.Z.; Resources, F.M.; Software, Z.Y. and J.Y.; Supervision, Y.Z.; Validation, M.L.; Visualization, F.M.; Writing—Original Draft, Z.Y., J.Y., F.M., M.L., and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61202479.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Q.; Liu, F.; Zhao, X.; Tan, Q. A CTR prediction model based on session interest. PLoS ONE 2022, 17, e0273048. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Bai, J.; Song, J.; Liu, X.; Zhao, Z.; Chen, X.; Gao, J. Atrank: An attention-based user behavior modeling framework for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Wu, S.; Li, Z.; Su, Y.; Cui, Z.; Zhang, X.; Wang, L. GraphFM: Graph factorization machines for feature interaction modeling. arXiv 2021, arXiv:2105.11866. [Google Scholar] [CrossRef]
Bai, J.; Geng, X.; Deng, J.; Xia, Z.; Jiang, H.; Yan, G.; Liang, J. A comprehensive survey on advertising click-through rate prediction algorithm. Knowl. Eng. Rev. 2025, 40, e3. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Zhang, W.; Han, Y.; Yi, B.; Zhang, Z. Click-through rate prediction model integrating user interest and multi-head attention mechanism. J. Big Data 2023, 10, 11. [Google Scholar] [CrossRef]
He, L.; Chen, H.; Wang, D.; Jameel, S.; Yu, P.; Xu, G. Click-through rate prediction with multi-modal hypergraphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Queensland, Australia, 1–5 November 2021; pp. 690–699. [Google Scholar]
Deng, J.; Shen, D.; Wang, S.; Wu, X.; Yang, F.; Zhou, G.; Meng, G. ContentCTR: Frame-level live streaming click-through rate prediction with multimodal transformer. arXiv 2023, arXiv:2306.14392. [Google Scholar]
Deng, K.; Woodland, P.C. Multi-head Temporal Latent Attention. arXiv 2025, arXiv:2505.13544. [Google Scholar] [CrossRef]
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan, Y.; Jin, J.; Li, H.; Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1059–1068. [Google Scholar]
Zhang, Y.; Xiao, Y.; Zhang, Y.; Zhang, T. Video saliency prediction via single feature enhancement and temporal recurrence. Eng. Appl. Artif. Intell. 2025, 160, 111840. [Google Scholar] [CrossRef]
Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu, X.; Gai, K. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5941–5948. [Google Scholar]
Dontu, S.; Addula, S.R.; Pareek, P.K.; Vallabhaneni, R.; Adnan, M.M. Attack detection from Internet of Things using TPE based self-attention based bidirectional long-short term memory. In Proceedings of the 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS), Hassan, India, 23–24 August 2024; pp. 1–6. [Google Scholar]
Xiu, Z. Financial Transaction Anomaly Detection Based on Transformer Model. Procedia Comput. Sci. 2025, 262, 1209–1216. [Google Scholar] [CrossRef]
Meenakshi, B.; Karunkuzhali, D. Enhancing cyber security in WSN using optimized self-attention-based provisional variational auto-encoder generative adversarial network. Comput. Stand. Interfaces 2024, 88, 103802. [Google Scholar] [CrossRef]
de Souza Pereira Moreira, G.; Rabhi, S.; Lee, J.M.; Ak, R.; Oldridge, E. Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September–1 October 2021; pp. 143–153. [Google Scholar]
Wang, Q.; Liu, F.; Xing, S.; Zhao, X.; Li, T. Research on CTR prediction based on deep learning. IEEE Access 2018, 7, 12779–12789. [Google Scholar] [CrossRef]
Zhou, F.; Kong, Q.; Zhang, Y. Advances in Temporal Point Processes: Bayesian, Deep, and LLM Approaches. arXiv 2025, arXiv:2501.14291. [Google Scholar]
Huang, C.; Wang, S.; Wang, X.; Yao, L. Modeling temporal positive and negative excitation for sequential recommendation. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1252–1263. [Google Scholar]
Yang, Y.; Zhang, L.; Liu, J. Temporal user interest modeling for online advertising using Bi-LSTM network improved by an updated version of Parrot Optimizer. Sci. Rep. 2025, 15, 18858. [Google Scholar] [CrossRef] [PubMed]
Fournier, Q.; Caron, G.M.; Aloise, D. A practical survey on faster and lighter transformers. Acm Comput. Surv. 2023, 55, 304. [Google Scholar] [CrossRef]
Taha, M.A. Logarithmic memory networks (lmns): Efficient long-range sequence modeling for resource-constrained environments. arXiv 2025, arXiv:2501.07905. [Google Scholar]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
Li, T.; Yang, X.; Ke, Y.; Wang, B.; Liu, Y.; Xu, J. Alleviating the inconsistency of multimodal data in cross-modal retrieval. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 4643–4656. [Google Scholar]
Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; pp. 1–8. [Google Scholar]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J. Vision-language pre-training: Basics, recent advances, and future trends. Found. Trends^® Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar] [CrossRef]
Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
Li, M.; Xu, R.; Wang, S.; Zhou, L.; Lin, X.; Zhu, C.; Zeng, M.; Ji, H.; Chang, S.F. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16420–16429. [Google Scholar]
Woo, J.; Chen, H. Epidemic model for information diffusion in web forums: Experiments in marketing exchange and political dialog. SpringerPlus 2016, 5, 66. [Google Scholar] [CrossRef]
Xu, Z.; Qian, M. Predicting popularity of viral content in social media through a temporal-spatial cascade convolutional learning framework. Mathematics 2023, 11, 3059. [Google Scholar] [CrossRef]
Nguyen, P.T.; Huynh, V.D.B.; Vo, K.D.; Phan, P.T.; Le, D.N. Deep Learning based Optimal Multimodal Fusion Framework for Intrusion Detection Systems for Healthcare Data. Comput. Mater. Contin. 2021, 66, 2555–2571. [Google Scholar] [CrossRef]
Dean, S.; Dong, E.; Jagadeesan, M.; Leqi, L. Recommender systems as dynamical systems: Interactions with viewers and creators. In Proceedings of the Workshop on Recommendation Ecosystems: Modeling, Optimization and Incentive Design, Vancouver, BC, Canada, 26–27 February 2024. [Google Scholar]
Hu, Y.; Hu, C.; Fu, S.; Fang, M.; Xu, W. Predicting key events in the popularity evolution of online information. PLoS ONE 2017, 12, e0168749. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Deng, Z.; He, C.; Liu, Y.; Kim, K.C. Super-resolution reconstruction of turbulent velocity fields using a generative adversarial network-based artificial intelligence framework. Phys. Fluids 2019, 31, 125111. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Cheng, M.Y.; Kusoemo, D.; Gosno, R.A. Text mining-based construction site accident classification using hybrid supervised machine learning. Autom. Constr. 2020, 118, 103265. [Google Scholar] [CrossRef]
Hládek, D.; Staš, J.; Pleva, M. Survey of automatic spelling correction. Electronics 2020, 9, 1670. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
Zhang, X.; Li, S.; Shi, N.; Hauer, B.; Wu, Z.; Kondrak, G.; Abdul-Mageed, M.; Lakshmanan, L.V. Cross-modal consistency in multimodal large language models. arXiv 2024, arXiv:2411.09273. [Google Scholar] [CrossRef]
Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef]
Li, M.; Gao, Y.; Zhao, H.; Li, R.; Chen, J. Progressive semantic aggregation and structured cognitive enhancement for image–text matching. Expert Syst. Appl. 2025, 274, 126943. [Google Scholar] [CrossRef]
Alshubaily, I. TextCNN with attention for text classification. arXiv 2021, arXiv:2108.01921. [Google Scholar] [CrossRef]
Koroteev, M.V. BERT: A review of applications in natural language processing and understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar] [CrossRef]
Voloshina, E.; Ilinykh, N.; Dobnik, S. Are language-and-vision transformers sensitive to discourse? A case study of ViLBERT. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), Prague, Czech Republic, 12 September 2023; pp. 28–38. [Google Scholar]
Bandyopadhyay, D.; Hasanuzzaman, M.; Ekbal, A. Seeing through VisualBERT: A causal adventure on memetic landscapes. arXiv 2024, arXiv:2410.13488. [Google Scholar] [CrossRef]
Hafner, M.; Katsantoni, M.; Köster, T.; Marks, J.; Mukherjee, J.; Staiger, D.; Ule, J.; Zavolan, M. CLIP and complementary methods. Nat. Rev. Methods Prim. 2021, 1, 20. [Google Scholar] [CrossRef]
Visweswaran, M.; Mohan, J.; Kumar, S.S.; Soman, K. Synergistic detection of multimodal fake news leveraging TextGCN and Vision Transformer. Procedia Comput. Sci. 2024, 235, 142–151. [Google Scholar] [CrossRef]
Khamis, A.K.; Agamy, M. Homogeneous Versus Heterogeneous Graph Representation for Graph Neural Network Tasks on Electric Circuits. IEEE Trans. Circuits Syst. Regul. Pap. 2025, 1–12. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed framework.

Figure 2. Flowchart of the multimodal virality encoder. Text, image, audio, and comment features are first embedded separately, while video inputs are processed through patch embedding and Swin Transformer blocks. The outputs are then progressively merged and fused in decoder layers through up-sampling and concatenation, forming a unified multimodal representation.

Figure 3. Overall structure of the hierarchical classification heads. The module is composed of four cascaded heads, each conditioned on a prior generated by a convolutional layer. At each level, the prior provides auxiliary guidance, and the cascade head refines the prediction progressively from level 1 to level 4.

Figure 4. Efficient optimization pipeline during inference. The left part illustrates prompt token pruning, where only selected active blocks are forwarded to the next layers. The right part shows overlap-aware KV swapping, in which inactive blocks are offloaded from the GPU cache to the CPU and reactivated when needed. Different colors indicate token states, including active, pruned, decoded, and cached blocks.

Figure 5. Bar chart comparing models in terms of Accuracy, Precision, Recall, F1-score, and AUROC.

Figure 6. Radar chart comparing the performance of the proposed model with that of baseline methods across five evaluation metrics (Accuracy, Precision, Recall, F1-score, and AUROC). The proposed model consistently outperforms all baselines, as indicated by its larger coverage area across all dimensions.

Figure 7. Line chart of Accuracy, Precision, Recall, F1-score, and AUROC under ablation variants.

Table 1. Data statistics for short video and news platforms.

Dataset	Collection Period	Number of Items	User Behavior Logs
Short Video (Douyin/Bilibili)	2023.01–2024.12	120,000 videos	3,500,000 logs
News (MIND)	2022.01–2023.12	65,000 articles	1,200,000 logs
News (Self-built)	2022.01–2023.12	40,000 articles	850,000 logs

Table 2. Overall performance comparison with baseline models.

Model	Accuracy	Precision	Recall	F1-Score	AUROC
TextCNN	0.842	0.826	0.818	0.822	0.876
BERT	0.871	0.857	0.849	0.853	0.902
ViLBERT	0.867	0.851	0.844	0.847	0.897
VisualBERT	0.874	0.858	0.852	0.855	0.904
CLIP	0.889	0.872	0.866	0.869	0.918
TextGCN	0.853	0.837	0.832	0.834	0.884
HeteroGNN	0.878	0.862	0.857	0.859	0.909
Proposed	0.926	0.909	0.903	0.906	0.953

Note: Bold numbers indicate the best performance.

Table 3. Ablation study on different modules of the proposed model.

Model Variant	Accuracy	Precision	Recall	F1-Score	AUROC
w/o Multimodal Virality Encoder	0.891	0.872	0.866	0.869	0.921
w/o Hierarchy Classification Heads	0.887	0.868	0.862	0.865	0.917
w/o Joint Pruning Optimization Module	0.894	0.876	0.870	0.873	0.924
Full Model	0.926	0.909	0.903	0.906	0.953

Note: Bold numbers indicate the best performance.

Table 4. Robustness analysis under noisy and imbalanced data conditions.

Condition	Accuracy	Precision	Recall	F1-Score	AUROC
20% Noisy Labels	0.884	0.866	0.861	0.863	0.918
30% Data Imbalance	0.893	0.875	0.870	0.872	0.927
50% Noisy + Imbalanced	0.861	0.842	0.838	0.840	0.903
Clean Data (Full Model)	0.926	0.909	0.903	0.906	0.953

Note: Bold numbers indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, Z.; Yang, J.; Meng, F.; Li, M.; Zhan, Y. Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation. Electronics 2025, 14, 3687. https://doi.org/10.3390/electronics14183687

AMA Style

Ye Z, Yang J, Meng F, Li M, Zhan Y. Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation. Electronics. 2025; 14(18):3687. https://doi.org/10.3390/electronics14183687

Chicago/Turabian Style

Ye, Zhaoyang, Jingyi Yang, Fanyu Meng, Manzhou Li, and Yan Zhan. 2025. "Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation" Electronics 14, no. 18: 3687. https://doi.org/10.3390/electronics14183687

APA Style

Ye, Z., Yang, J., Meng, F., Li, M., & Zhan, Y. (2025). Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation. Electronics, 14(18), 3687. https://doi.org/10.3390/electronics14183687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation

Abstract

1. Introduction

2. Related Work

2.1. CTR Prediction and Sequential Modeling Methods

2.2. Multimodal Content Understanding and Propagation Modeling

2.3. Content Popularity and Virality Modeling in Recommendation Systems

3. Materials and Methods

3.1. Data Collection

3.2. Data Augmentation

3.3. Proposed Method

3.3.1. Overall

3.3.2. Multimodal Virality Encoder

3.3.3. Hierarchy Classification Heads

3.3.4. Joint Pruning Optimization Module

4. Results and Discussion

4.1. Hyperparameter Settings

4.2. Evaluation Metrics

4.3. Baseline

4.4. Overall Performance Comparison with Baseline Models

4.5. Ablation Study on Different Modules of the Proposed Model

4.6. Robustness Analysis Under Noisy and Imbalanced Data Conditions

4.7. Discussion

4.8. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI