MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection

Zhai, Zhengli; Xu, Cheng; Li, Yang; Su, Shunqi

doi:10.3390/sym18030497

Open AccessArticle

MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection

by

Zhengli Zhai

^*

,

Cheng Xu

^*,

Yang Li

and

Shunqi Su

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2026, 18(3), 497; https://doi.org/10.3390/sym18030497

Submission received: 26 January 2026 / Revised: 25 February 2026 / Accepted: 10 March 2026 / Published: 14 March 2026

(This article belongs to the Special Issue Symmetry and Asymmetry in Information Security and Network Security)

Download

Browse Figures

Versions Notes

Abstract

Recommender systems are now integral to many online platforms, including e-commerce, social media, and content streaming services. However, their widespread use also exposes them to significant security threats. One of the most critical is the shilling attack, where fake user profiles are injected to manipulate recommendation results. Such attacks undermine system fairness and erode user trust. Traditional detection methods mostly rely on a single perspective, such as a fake profile, temporal behavior or a graph structure, and they have difficulty dealing with complex and changeable attack strategies. Therefore, we propose a multi-view cross-attention (MVCA) attack detection framework. This system integrates three complementary features: the user–item interaction graph structure, the temporal behavior sequence, and the local scoring mode. We propose a bidirectional cross-attention mechanism to achieve deep information interaction, dynamically mine the potential correlations between different views, solve the collaborative optimization of each module, and improve the accuracy of identifying fake users. Extensive experiments conducted on the MovieLens and Netflix datasets have shown that MVCA generally outperforms several established baseline methods. Its strong performance in handling different types and scales of attacks demonstrates the method’s adaptability and robustness for detecting shilling attacks.

Keywords:

recommendation systems; shilling attack; fake profile; cross-attention mechanism

1. Introduction

Recommender systems have become an indispensable component of modern online platforms, facilitating personalized content delivery and enhancing user engagement [1]. By analyzing historical interactions and user preferences, these systems effectively alleviate information overload and improve user satisfaction [2]. However, their widespread adoption has also made them a prime target for malicious activities, particularly shilling attacks, wherein attackers inject biased profiles into the system to manipulate recommendation outcomes [3,4]. Such attacks not only degrade the quality of recommendations but also undermine user trust and platform credibility [5].

The core of recommender systems lies in their ability to model user preferences and item characteristics, typically through collaborative filtering, content-based methods, or hybrid approaches [6,7]. While these systems excel at predicting user interests under normal conditions, they are inherently vulnerable to manipulation due to their open nature and reliance on user-contributed data [8]. Attackers exploit this vulnerability by injecting carefully crafted fake profiles, often designed to mimic genuine user behavior, in order to bias the system toward promoting or demoting specific items. This form of manipulation, known as a shilling attack or profile injection attack, poses a serious challenge to the integrity and reliability of recommender systems [9,10], especially in high-stakes domains such as e-commerce, social media, and content streaming platforms.

The core challenge stems from the dynamic and complementary nature of signals from different behavioral views. For example, a local rating pattern may indicate anomalies in score distribution, and a temporal sequence might reveal suspicious bursts of activity. However, strong evidence of an attack often comes from the correlation between a user’s structural position in the interaction graph and their sequential behavior [11]. Shallow fusion strategies fall short of modeling these complex, cross-view dependencies. They treat each view as a separate, static information stream and lack a mechanism to allow features from one view to dynamically guide the selection and weighting of relevant cues in another [12]. This results in a representation that is less discriminative and robust against strategically crafted attacks [13].

To mitigate the threat of shilling attacks, researchers have developed a variety of detection methods [14]. Early approaches primarily relied on statistical analysis and handcrafted behavioral features, such as rating deviation or entropy, to identify anomalies [15]. More recently, graph-based methods have been adopted to capture structural anomalies in user–item interaction networks, while sequence-based models leverage temporal patterns to detect suspicious behavior [16]. Despite these advances, most existing methods are limited by their reliance on a single view of user behavior—whether structural, temporal, or local—which restricts their ability to detect sophisticated and adaptive attacks [17,18]. Moreover, even multi-view approaches often integrate heterogeneous features in a shallow manner, such as simple concatenation or linear fusion, failing to capture the deep, nonlinear interactions between different behavioral perspectives [19].

To address the issue of attack detection in recommendation systems, we propose a framework based on a multi-view cross-attention mechanism, which is used to effectively extract and fuse features from different data sources to identify potential attack behaviors. The core idea of this framework is to enhance the model’s ability to recognize attack behaviors by constructing a user–item interaction graph and utilizing multiple feature extraction methods and cross-attention mechanisms. Our main contributions are summarized as follows:

We propose a multi-view feature fusion method. Unlike traditional methods that rely only on one data view, we systematically integrate three different feature views to comprehensively describe user behavior. By combining user–item interaction diagrams, sequence information, and high-order connectivity, rich features are extracted to enhance the expressive ability of the model.
We design a dynamic and bidirectional cross-attention mechanism. Traditional multi-feature fusion methods often have insufficient information interaction, while this mechanism allows for in-depth interaction between two information sources, thereby unearthing the intrinsic and complex nonlinear relationships between them.

Subsequently, we conducted extensive comparative experiments on representative object detection methods and evaluated the proposed approach on two benchmark datasets to validate its effectiveness.

2. Related Work

2.1. Shilling Attack Modules

Shilling attacks, often referred to as profile injection attacks, involve the intentional creation of counterfeit user profiles that closely simulate authentic user behavior with the aim of manipulating recommendation algorithms. The primary objective of these attacks is to either enhance push attacks or diminish nuke attacks on the visibility of specific target items by introducing biased ratings [20,21]. Attackers typically employ a systematic approach in constructing these profiles, which can be categorized into four principal components:

Target Item Set (

I_{T}

): This set comprises the specific items that the attacker intends to either promote or demote. In the context of a push attack, the target items are assigned the maximum possible ratings to enhance their visibility and appeal. Conversely, in a nuke attack, these items receive the minimum ratings to diminish their perceived value and popularity.

Selected Item Set (

I_{S}

): This category includes items that are strategically chosen to augment the efficacy of the attack. For instance, in a bandwagon attack scenario, the selected items are typically those with the highest popularity, which are then rated favorably. This tactic is employed to bolster the credibility of the fake profile by aligning it with mainstream preferences.

Filler Item Set (

I_{F}

): The filler items are randomly selected and rated to simulate the behavior of a legitimate user, thereby enhancing the authenticity of the fake profile. The rating strategy for these items is contingent upon the specific attack methodology. For example, in a random attack, filler items may be assigned arbitrary ratings, whereas in an average attack, the ratings are calibrated to approximate the mean rating of the item, thereby maintaining a semblance of normal user behavior.

Unrated Item Set (

I_{N}

): This set consists of items that the attacker deliberately chooses not to rate. The rationale behind this decision is to avoid raising suspicions that could arise from an excessive number of ratings, which might otherwise signal the presence of a fabricated profile. By selectively omitting ratings, the attacker aims to maintain a low profile and reduce the likelihood of detection.

Several shilling attack models have been identified in the literature, each with its own strategy for selecting and rating items:

Random Attack: In this model, the filler items are randomly selected and assigned random ratings. The target items are given the highest or lowest ratings depending on whether the attack is a push or nuke attack.

Average Attack: Here, the filler items are randomly selected but are assigned ratings close to the average rating of the item. This makes the fake profile appear more realistic, as the ratings are consistent with the general user behavior.

Bandwagon Attack: This attack leverages the popularity of certain items. The selected items are the most popular ones, which are given the highest ratings. The filler items are randomly selected and assigned random ratings. This strategy is effective because popular items are more likely to be rated by genuine users, making the fake profile less suspicious.

The division and construction strategies of the different attack models are shown in detail in Table 1.

2.2. Detection of Shilling Attacks in Recommendation Systems

The existing research on shilling attack detection has evolved through several paradigms, which can be broadly categorized into methods based on statistical analysis, graph-based modeling, sequential pattern mining, and more recently, multi-view integration. Each category captures a distinct facet of user behavior yet frequently proves insufficient for delivering comprehensive security protection.

Statistical and Behavior-Based Methods: Early detection approaches primarily relied on identifying statistical anomalies in user rating patterns. For instance, Mehta et al. proposed PCA-Based Detector [22], which utilizes dimensionality reduction techniques such as principal component analysis, combined with prior knowledge and observed scoring data, to calculate the abnormal probability of users. Similarly, Yang et al. proposed BayesDetector [23] based on the Bayesian inference idea, which identifies abnormal users by fusing prior distributions with observed data. While these methods are interpretable and computationally efficient for simple attack models, they often struggle to capture the complex, nonlinear patterns characteristic of sophisticated attacks. Their performance is highly dependent on hand-crafted features and strong assumptions about the distribution of genuine and attack profiles, which limits their adaptability and generalization to evolving attack strategies.

Graph-Based Methods: With the success of graph neural networks, numerous studies have modeled user–item interactions as bipartite or similarity graphs to capture high-order structural dependencies for attack detection. For example, Zhang et al. proposed GraphRFI [24], which combines GCNs with random forests, achieving both robust recommendation and fraud detection. Simultaneously, Wu et al. proposed USG-SAD [25], which utilizes the comprehensive score correlation and bias to construct a user similarity map, and it then uses a GCN to learn user embeddings for classification. These methods excel at modeling the global structural relationships and connectivity patterns within the system, which are crucial for identifying users who form abnormal clusters or have suspicious link structures. However, a significant limitation is that they often treat the user–item graph as static, overlooking the rich temporal dynamics and evolution inherent in user interaction sequences. This temporal information can be a critical signal for identifying malicious behavior, such as burst-rating activities.

Sequence-Based Methods: To capture the temporal evolution of user behavior, sequence-based models have been introduced, such as DL-DRA [26] proposed by Zhou et al., which employs deep neural networks with dynamic linear dimensionality reduction to learn from rating sequences efficiently. This recurrent neural network and its variants can be used to simulate user interaction sequences, preserving the historical background to detect anomalies that change over time. While these approaches are effective at learning sequential patterns and identifying unusual time intervals or rating sequences, they typically operate in isolation. They fail to leverage the complementary, global structural information encapsulated in the user–item graph, which can provide crucial context about a user’s role in the entire network.

Methods Focusing on Local Patterns: Convolutional Neural Networks (CNNs) have been adapted to analyze local rating patterns or the aggregation of user and item features. For example, Zhou et al. proposed robust recommendation-oriented malicious attack detection [27], which integrates multiple CNN models through bagging and synthesizing judgments from different perspectives through a majority voting mechanism, through which more stable and comprehensive detection results can be obtained. This method treats user and item feature vectors as one-dimensional signals and uses convolutional layers to extract salient, localized features that might be indicative of anomalous rating behavior. However, these CNN-based detectors often focus exclusively on this local view, failing to integrate the broader temporal and structural context of user behavior. This narrow focus limits their holistic understanding and makes them vulnerable to attacks designed to appear normal in local feature windows. ”While these works establish CNN’s effectiveness for local pattern analysis, they typically employ CNN as the sole feature extractor. In contrast, our framework positions CNN as one of three complementary views, specifically responsible for capturing local rating anomalies, while delegating temporal dynamics to GRU and global structure to GCN. This division of labor is motivated by the heterogeneous nature of shilling attack signatures—no single architectural inductive bias suffices for comprehensive detection.”

Multi-View and Hybrid Methods: Recognizing the limitations of single-view models, some recent works have attempted to integrate multiple perspectives, such as CoDetector [28] proposed by Dou et al., which integrates matrix factorization and word embedding techniques to uncover latent features from both user–user and user–item symbiosis matrices. Another method is DGA-MFCA, proposed by Xu et al. [29], which detects attackers within groups by grouping based on user characteristics and using Gaussian-RBF analysis. However, their integration strategies have inherent architectural limitations. CoDetector and DGA-MFCA can be described as multi-stage pipelines rather than deeply integrated multi-view learning systems. They handle different feature spaces separately and combine them at the decision level or through shallow statistical measures, which prevents the models from capturing complex, nonlinear interactions, such as the complex and nonlinear relationship between user structural roles and their temporal behaviors. The ITRN attempts to combine temporal and structural cues, but its fusion mechanism performs the shallow concatenation of the independently learned representations before the final classifier. This static aggregation lacks the ability to dynamically adjust the importance of each view based on the context of other views.

Although these methods have achieved relatively good results, they often rely on a single perspective of user behavior, such as rating patterns, temporal dynamics or graph structures, which limits their ability to detect complex and evolving attack strategies. Moreover, many multi-perspective detection frameworks integrate heterogeneous features in a shallow or static manner and are unable to capture the deep interrelationships between different behavioral perspectives. This gap prompts us to propose MVCA, which aims to systematically and deeply integrate the local, sequential and structural perspectives of user behavior through a dynamic cross-attention mechanism.

The MVCA we propose is designed as an end-to-end multi-perspective collaborative learning framework in its architecture. The core difference lies in the introduction of a bidirectional cross-attention mechanism, which systematically and deeply integrates the local, sequential and structural perspectives of user behavior through the dynamic cross-attention mechanism. Unlike shallow fusion, this mechanism allows for deep and dynamic information interaction between different perspectives. Specifically, features from the continuous perspective (GRU) can guide the aggregation of the structural perspective (GCN), and vice versa, enabling the model to discover potential correlations that are not easily detectable when individual perspectives are processed alone. This architecture enables MVCA to more effectively handle complex and constantly evolving attack strategies.

3. The Proposed Method

In this section, we present our MVCA framework, whose overall architecture is shown in Figure 1. We design an attack detection framework consisting of several attack feature extraction components. It is composed of an interactive attention mechanism network based on RNNs and GCNs and a feature extraction network based on CNNs, which respectively model the information of different scales of user–item interactions, learn different patterns and rules of user interaction with the project, and comprehensively capture potential attack behavior patterns. Finally, the features of users and items obtained from multiple sources are embedded in multiple views and input into the multi-layer perceptron to obtain the final user classification result. More detailed information will be explained in the following sections.

3.1. CNN-Based Feature Extraction

To extract local, representative features from the original user and item data, we employ a CNN as the local feature extraction module. CNNs are chosen for their inherent advantages in capturing local correlations and providing translation invariance. This enables them to identify local abnormal patterns in user–item interactions, such as atypical rating combinations or densely clustered anomalous regions. Secondly, compared to fully connected networks, CNNs myopically reduce model complexity through parameter sharing and local connections, enhancing the generalization ability. This module first concatenates user vectors (u) and item vectors (i) to obtain the initial input feature matrix (X) and processes them as follows. Firstly, the input features are passed through two convolutional layers to detect local correlations and anomalous patterns. This is formulated as:

C^{(l)} = ReLU (Conv 1 D (C^{(l - 1)}))

(1)

where

C^{(0)}

=

X_{i}

, and Conv1D denotes a one-dimensional convolution operation.

Subsequently, a max-pooling layer is applied for down-sampling, reducing feature dimensionality while retaining the most salient information:

P = MaxPool (C^{(L_{c})})

(2)

This is followed by a global average pooling layer to aggregate the features into a compact representation. Finally, a fully connected layer transforms the pooled features into the final local feature representations:

F_{u c}, F_{u i} = FC (GlobalAvgPool (P))

(3)

Here,

F_{u c}

and

F_{u i}

denote the extracted local features for the user and item, respectively, which are utilized in the subsequent fusion and classification stages. Its architecture is shown in Figure 2.

3.2. Sequential Feature Extraction

To capture the temporal dynamics and sequential dependencies inherent in user–item interactions, we employ a Gated Recurrent Unit (GRU) network to model the historical behavior sequences of users. The GRU network is chosen for its ability to effectively capture long-range dependencies while mitigating the vanishing gradient problem commonly encountered in traditional RNNs. We construct the interaction sequence for each user by sorting their historical ratings in chronological order. Each interaction at time step t is represented as a feature vector (

x_{t}

), which is formed by concatenating the user embedding, item embedding, and corresponding rating value:

x_{t} = Concat (u_{t}, i_{t}, r_{t})

(4)

where

u_{t}

and

i_{t}

denote the user and item embeddings at time t, and r_t represents the rating value. For users with fewer than T interactions, we apply zero-padding to maintain a fixed sequence length. The GRU processes the input sequence X = {

x_{1}

,

x_{2}

,…,

x_{T}

} through iterative updates of its hidden state. The update mechanism at each time step is represented as:

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(5)

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(6)

{\tilde{h}}_{t} = \tanh (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}) + b_{h})

(7)

h_{T} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(8)

where z_t and

r_{t}

represent the update and reset gates, respectively,

σ

denotes the sigmoid activation function, and

⊙

indicates element-wise multiplication. The matrices

W_{*}

,

U_{*}

and bias vectors (

b_{*}

) are learnable parameters.

After processing the entire sequence, we extract the final hidden state as the sequential feature representation:

F_{s} = h_{T}

(9)

where

F_{s} \in ℝ^{d_{h}}

encapsulates the temporal behavior patterns of the user and serves as input to the subsequent modules for attack detection. Its architecture is shown in Figure 3.

3.3. GCN-Based Feature Extraction

To capture the similarities among users, the correlations among items, and the complex high-level interaction patterns between users and items, we propose an interactive feature learner for the GCN based on the correlation between users and items. It can comprehensively detect anomalous behavior within complex interaction networks, and the model integrates both localized and global relational patterns. Its architecture is shown in Figure 4 and consists of three main components: the embedding layer provides the initial embedding, the embedding propagation layer simulates the high-order connectivity within the bipartite graph through a stack of enhanced graph convolutional network layers, and the aggregation layer connects the embeddings of different layers to acquire a fine-grained representation of interaction features.

The GCN feature extractor we proposed is divided into an embedding layer and an embedding propagation layer. In the embedding layer, the initial embedding vector for users and items is generated. The embedding vector

e_{u}^{(0)} \in ℝ^{d}

(

e_{i}^{(0)} \in ℝ^{d}

) represents a user (u) and an item (i), and d represents the embedding size. Then, Equation (10) shows the embedding matrix (E), where N and M indicate the total counts of users and items, respectively. This embedding matrix can map each user and item to an initialized embedding representation:

E = [\begin{matrix} e_{u 1}^{(0)}, \dots, e_{u N}^{(0)}, e_{i 1}^{(0)}, \dots, e_{i M}^{(0)} \end{matrix}]

(10)

Within the embedding propagation framework, multiple enhanced GCN modules are sequentially stacked atop the interaction graph topology to facilitate the integration of high-order collaborative signals, thereby iteratively refining latent representations for both users and items. The GCN operation within this propagation schema comprises two fundamental phases: message generation and neighborhood aggregation. Specifically, by cascading k such propagation layers, each node becomes receptive to information emanating from its k-hop neighborhood, enabling the capture of multi-hop relational patterns. At the k-th order of the GCN, the message

m_{u \leftarrow i}^{(k)}

passed from node i to node u is defined by Equation (11). Considering the u node’s characteristics, we also add a self-connection to the node, and the self-connection message

m_{u \leftarrow u}^{(k)}

of u is shown in Equation (12):

m_{u \leftarrow i}^{(k)} = p u i (W_{a}^{(k)} e_{i}^{(k - 1)} + W_{b}^{(k)} (e_{i}^{(k - 1)} ⊙ e_{u}^{(k - 1)}))

(11)

m_{u \leftarrow u}^{(k)} = W_{a}^{(k)} e_{u}^{(k - 1)}

(12)

where

W_{a}^{(k)} \in ℝ^{d^{'} \times d}, W_{b}^{(k)} \in ℝ^{d^{'} \times d}

represents the learnable transformation parameters employed at the k-th propagation stage for distilling salient features during information diffusion, where d signifies the dimensionality of the latent representations, and d′ denotes the transformation size.

e_{i}^{(k - 1)}

,

e_{u}^{(k - 1)}

are the item and user embedding representations generated from the previous message propagation stages, and they store information about their (k−1)-order neighbors. Our approach extends beyond conventional GCNs by explicitly modeling the direct interaction between the user (u) and item (i), rather than relying solely on neighborhood information aggregation. The element product ⊙ incorporates the relationship between users and items into the information generation process. pui is set to Laplace’s parametrization, as shown in Equation (13), where

N_{u}

and

N_{i}

denote the k-order neighbors of the user (u) and item (i), respectively. A normalization term (pui) is introduced to the adjacency matrix. This term adaptively modulates the weighting contributions from neighboring nodes, thereby mitigating potential bias caused by the dominance of certain information flows during multi-layer convolutional aggregation:

p u i = \frac{1}{\sqrt{|N_{u}| |N_{i}|}}

(13)

During the message aggregation stage, the representation of node u is refined by incorporating messages propagated from its k-th-order neighbors. The specific function used for this aggregation is provided in Equation (14):

e_{u}^{(k)} = LeakyReLU (m_{u \leftarrow u}^{(k)} + \sum_{i \in N_{u}} m_{u \leftarrow i}^{(k)})

(14)

Motivated by the inherent sparsity of user–item interactions in recommendation systems, we incorporate the LeakyReLU activation function to introduce necessary nonlinearity into the model. To capture higher-order dependencies, we utilize a k-layer GCN architecture. This design propagates and aggregates information across k hops, ultimately yielding the k-th-order representations for both user (

e_{u}^{(k)}

) and item (

e_{i}^{(k)}

).

e_{i}^{(k)} = LeakyReLU (m_{i \leftarrow i}^{(k)} + \sum_{u \in N_{i}} m_{i \leftarrow u}^{(k)})

(15)

In the aggregation layers, in contrast to standard GCNs that draw upon representations from only a single layer (typically either early or final), our approach integrates features across all layers to capture information at heterogeneous granularities. From the perspective of system architecture, the lower layer of the model is responsible for encoding the real-time interaction signals of users at the micro-level, while the upper layer is used to extract trends and patterns at the macro-level, including dynamically evolving user preferences and unconventional network structure features. By directly concatenating the vectors output by each layer, the system achieves the effective fusion of information of different granularities, and we obtain user embeddings (

e_{u}^{*}

) and item embeddings (

e_{i}^{*}

) that encode both local and global interaction information, as shown in Equations (16) and (17):

e_{u}^{*} = e_{u}^{(0)} ‖ \dots ‖ e_{u}^{(k)}

(16)

e_{i}^{*} = e_{i}^{(0)} ‖ \dots ‖ e_{i}^{(k)}

(17)

Algorithm 1 outlines the implementation of the interactive feature learner based on the GCN.

Then, the user embedding (

e_{u}^{*}

) and the item embedding (

e_{i}^{*}

) are input into the CNN to obtain the user–item behavior interaction feature (

F_{b}

).

Algorithm 1. GCN-based interaction feature learner

Input: User-item interaction graph G, user set U, item set I, the embedding propagation depth K

, GCN network parameters \{W_{a}^{(1)}, W_{a}^{(2)}, \dots W_{a}^{(K)}, W_{b}^{(1)}, W_{b}^{(2)}, \dots W_{b}^{(K)}\}

, user initialized embedding e_{u}^{(0)}

, item initialized embedding e_{i}^{(0)}

Output

: Final embedding of user and item representations e_{u}^{*}

and e_{i}^{*}

1.

e_{u}^{*} = e_{u}^{(0)}

2.

e_{i}^{*} = e_{i}^{(0)}

3. for k in range(1:K) do

4. for u ∈ U do

5.

m_{u \leftarrow u}^{(k)} = W_{a}^{(k)} e_{u}^{(k - 1)}

6. for i ∈

N_{u}

do

7.

m_{u \leftarrow i}^{(k)} = p u i (W_{a}^{(k)} e_{i}^{(k - 1)} + W_{b}^{(k)} (e_{i}^{(k - 1)} ⊙ e_{u}^{(k - 1)}))

8. end for

9.

e_{u}^{(k)} = LeakyReLU (m_{u \leftarrow u}^{(k)} + \sum_{i \in N_{u}} m_{u \leftarrow i}^{(k)})

10. end for

11.

e_{u}^{*} = e_{u}^{*} | | e_{u}^{(k)}

12. end for

13. for k in range(1:K) do

14. for i ∈ I do

15.

m_{i \leftarrow i}^{(k)} = W_{a}^{(k)} e_{i}^{(k - 1)}

16. for i ∈

N_{u}

do

17.

m_{i \leftarrow u}^{(k)} = p u i (W_{a}^{(k)} e_{u}^{(k - 1)} + W_{b}^{(k)} (e_{u}^{(k - 1)} ⊙ e_{i}^{(k - 1)}))

18. end for

19.

e_{i}^{(k)} = LeakyReLU (m_{i \leftarrow i}^{(k)} + \sum_{u \in N_{i}} m_{i \leftarrow u}^{(k)})

20. end for

21.

e_{i}^{*} = e_{i}^{*} | | e_{i}^{(k)}

22. end for

23.

return e_{u}^{*}

, e_{i}^{*}

3.4. Interactive Cross-Attention Module

To enable the deep and dynamic fusion of these views, we design a bidirectional cross-attention module that facilitates interactive information exchange across feature representations. This module allows the model to adaptively highlight task-relevant cues from each view in relation to the others, thereby generating discriminative representations tailored for attack detection. The entire framework is trained end to end under a unified cross-entropy loss, ensuring that all feature extractors are co-optimized toward the ultimate goal of accurately classifying malicious users. On this basis, the interactive attention mechanism can assign different weights to the features extracted by the GCN and GRU according to different task requirements. Through the interactive attention mechanism, the model can selectively obtain key information from the features extracted by the GCN and GRU, effectively integrating structural information and sequence information. This enables the model to comprehensively utilize these two types of complementary information, thereby gaining a more comprehensive understanding of user behavior and item features.

Firstly, the user–item behavior interaction feature (

F_{b}

) extracted by the GCN and the user–item sequence feature (

F_{s}

) extracted by the GRU capture the dependency relationships among the elements in the sequence through the self-attention mechanism.

F_{s}

=

[f_{s 1}, f_{s 2}, \dots, f_{s n}] \in ℝ^{n \times d_{s}}

, where n denotes the sequence length and

d_{s}

denotes the feature dimension. Similarly, the user–item sequence features

F_{b}

=

[f_{b 1}, f_{b 2}, \dots, f_{b m}] \in ℝ^{m \times d_{b}}

, where m represents the number of nodes related in the graph, and

d_{b}

is the feature dimension.

Firstly, self-attention processing is carried out on the two features to enhance their internal representational capabilities.

For the sequence feature (

F_{s}

):

Q_{s} = F_{s} W_{q}^{s}, K_{s} = F_{s} W_{k}^{s}, V_{s} = F_{s} W_{v}^{s}

(18)

F_{s}^{self} = softmax (\frac{Q_{s} K_{s}^{T}}{\sqrt{d_{k}}}) V_{s}

(19)

For the behavioral interaction feature (

F_{b}

):

Q_{b} = F_{b} W_{q}^{b}, K_{b} = F_{b} W_{k}^{b}, V_{b} = F_{b} W_{v}^{b}

(20)

F_{b}^{self} = softmax (\frac{Q_{b} K_{b}^{T}}{\sqrt{d_{k}}}) V_{b}

(21)

where

W_{q}^{s}

,

W_{k}^{s}

,

W_{ν}^{s}

,

W_{q}^{b}

,

W_{k}^{b}

and

W_{ν}^{b}

denote the learnable parameters that map the original feature matrix to query, key, and value vectors, respectively.

d_{k}

is the dimension of K.

F_{b}^{self}

and

F_{s}^{self}

respectively represent the features of the user–item behavior interaction feature (

F_{b}

) extracted by the GCN and user–item sequence feature (

F_{s}

) extracted by the GRU after self-attention processing. Then, cross-attention is used to capture the interaction relationship between

F_{s}

and

F_{b}

, which can be defined as:

Q_{s b} = F_{s}^{self} W_{q}^{s b}, K_{s b} = F_{b}^{self} W_{k}^{s b}, V_{s b} = F_{b}^{self} W_{v}^{s b}

(22)

F_{s b} = softmax (\frac{Q_{s b} K_{s b}^{T}}{\sqrt{d_{s b}}}) V_{s b}

(23)

where

F_{s}^{self}

denotes the query, and

F_{b}^{self}

, as the key and value, generates the

Q_{s b}

query through the weight matrix

W_{q}^{s b} \in ℝ^{d_{s} \times d_{s b}}

for

F_{s}^{self}

, and generates the key (

K_{s b}

) and value (

V_{s b}

) for

F_{b}^{self}

through

W_{k}^{s b} \in ℝ^{d_{b} \times d_{s b}}

.

In the same way, the cross-attention features of the item center (

F_{b s}

) are obtained:

Q_{b s} = F_{b}^{self} W_{q}^{b s}, K_{b s} = F_{s}^{self} W_{k}^{b s}, V_{b s} = F_{s}^{self} W_{v}^{b s}

(24)

F_{b s} = softmax (\frac{Q_{b s} K_{b s}^{T}}{\sqrt{d_{b s}}}) V_{b s}

(25)

where

d_{s b}

denotes the interaction feature dimension, and

W_{v}^{s b} \in ℝ^{d_{b} \times d_{s b}}

calculates the attention score matrix. Finally, the cross-attention feature centered on the user (

F_{s b}

) is obtained.

In this way, two cross-attention outputs are ultimately obtained, which are represented as feature

F_{s b}

, centered on the sequence perspective and integrating structural information, and feature

F_{b s}

, centered on the structural perspective and integrating sequence information. These two features are used as the input for the multi-view fusion for the final attack detection classification.

3.5. Feature Fusion for Attack Detection

After obtaining features from various sources, the local features of the CNN and the associated features of the interactive attention are concatenated by dimensions to form a multi-source feature joint matrix, which can be defined as:

F_{fusion} = F_{u c} \oplus F_{u i} \oplus F_{s b} \oplus F_{b s}

(26)

where ⊕ denotes the concatenation of the features;

F_{u i}

and

F_{u c}

denote the local features of the user and item processed by the CNN.

F_{s b}

and

F_{b s}

respectively denote the user–item historical sequence features and high-order behavior node features processed by the self-attention mechanism.

To further enhance the representation ability and reduce feature redundancy, we adopt a linear projection layer to map the fused features into a shared latent space, and we then input it into the multi-layer perceptron to predict the attack classification, and the loss function adopts cross-entropy loss, which can be defined as:

y = softmax (W_{o} \cdot F_{fusion} + b_{o})

(27)

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

(28)

where

W_{o}

and

b_{o}

denote the learnable parameters used to output a probability distribution of behavior for each user.

y_{i}

= 1 if the user (u) is a genuine user, and

y_{i}

= 0 otherwise, and N denotes the number of training users.

3.6. Complexity Analysis

To comprehensively evaluate the computational efficiency of the MVCA framework, this section analyzes the complexity and time consumption of the model algorithm. The main parameters are defined as follows: the total number of users (U), the total number of items (I), the number of interaction edges between users and items (E), the embedding dimension (d), the number of GCN propagation layers (K), the average length (L) of the user’s historical behavior sequence, and the number of heads in the multi-head cross-attention (H). The complexity analysis of each module is as follows:

The local feature extraction module performs one-dimensional convolution on the embeddings of users and items. Let the size of the convolution kernel be k and the number of filters be F. Then, the complexity of processing all users and items is approximately O((U + I)·k·F·d).

The sequence feature extraction module uses the GRU to process each user’s sequence with a length of L. The behavioral sequence of L is modeled. The complexity of the GRU is O(L·d²), and the total complexity of all users is O(U·L·d²).

The structure feature extraction module performs message passing and aggregation on the user–item bipartite graph through K-layer graph convolution. Each layer needs to traverse all interaction edges for neighbor aggregation, with a complexity of O(E·d); and it also performs linear transformation, with a complexity of O((U + I)·d²). Therefore, the total complexity of the GCN module is O(K·E·d + K·(U + I)·d²).

The cross-attention module conducts bidirectional interaction on the features extracted by the GCN and GRU. Firstly, the sequence features (with dimension L × d) and the graph node features (

|N|

× d, where

|N|

is the average number of nodes related to the user, approximately E/U) are subjected to self-attention calculations, with complexities of

O (L^{2} \cdot d)

and O((E/U)2·d), respectively. Then, mutual attention calculation is carried out, with a complexity of O(L·(E/U)·d). After considering the multi-head mechanism, the complexity needs to be multiplied by the number of heads (H), where H is a constant. Therefore, the total complexity of the cross-attention module can be expressed as O(H·(L² + (E/U)² + L·(E/U))·d).

Therefore, the overall complexity of the proposed approach is O(K·E·d + K·(U + I)·d² + U·L·d² + (L² + ((E/U)²⁾ + L·E/U)·d). In the actual recommendation system scenario, the number of interaction edges (E) is usually much larger than the number of users and items (E ≫ U + I), the propagation layer number (K), sequence length (L), and embedding dimension (d) are all constants, and the data scale has a quasi-linear relationship. This indicates that MVCA has good scalability and can meet the detection requirements of large-scale recommendation systems. Theoretical analysis results are consistent with the trend of the subsequent measurement of running time, that is, the running time is roughly linearly related to the data volume. This verifies the feasibility of the algorithm in practical deployment.

4. Experiment Results and Analysis

4.1. Experimental Datasets

In our experiment, we utilized two commonly used public datasets to verify our experimental results: MovieLens 10M [30] and Netflix [31], which contain real information configuration files. Each piece of data represents the attribute information corresponding to a click behavior. The original dataset includes a user id, an item id, user features, item features, a timestamp, and the tag corresponding to each piece of data. Both of these datasets used in the attack archive were generated by the attack model. The details of each dataset are shown as follows:

MovieLens 10M: This dataset contains 1120 users, 10,657 movies and 199,040 ratings. Each user has at least 20 ratings. The rating values are between 1 and 5. The MovieLens dataset does not include any attacker profiles. For the purposes of this research, without loss of generality, fake profiles generated by attack models were injected into the dataset.

The Netflix dataset is a competition dataset, consisting of 17,770 records (each record is a quadruple of the user ID, item ID, rating and timestamp) and 480,189 users. All rating values are integer data between 1 and 5 and were collected between October 1998 and December 2005. We randomly selected 215,884 ratings from 2000 users for 4000 movies as the experimental dataset. For the purpose of this study, the data generated by the forged personal data attack model was injected into the dataset without losing generality or labels corresponding to each piece of data.

In the attack detection phase, we split the original training set into a new training set and a validation set with a ratio of 8:2. The dataset provides a test set. We processed the datasets, adjusting the attack size and fill scale to the size we needed, and generated real datasets with different attack scales and fill scales. These profiles varied in attack sizes and filler sizes. To ensure the stability and reliability of the results and to reduce the influence of random factors, in addition, we conducted random sampling (without replacement sampling) on the MovieLens 10M dataset, and we performed this operation for each set. All reported experimental data are the average of five independent runs, and the standard deviations of the key metrics are also indicated to show the stability of the results.

4.2. Evaluation Metrics

We adopt the precision, recall, and F1-measure metrics to assess the performance of USG-SAD, defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(29)

R e c a l l = \frac{T P}{T P + F N}

(30)

F 1 - m e a s u r e = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s o n + r e c a l l}

(31)

where TP denotes the number of attack users correctly identified, FN represents the number of attack users incorrectly classified as genuine users, and FP indicates the number of genuine users mistakenly classified as attack users.

4.3. Experiment Results and Analysis

To verify the validity of our method, we used the following works as comparative methods in our experiments:

(1): CoDetector [28]: The fundamental concept of this model is to integrate matrix decomposition and word embedding techniques in order to uncover the latent features of users through both the user–user symbiosis matrix and the user–item symbiosis matrix. This approach facilitates the detection of attack behaviors within recommender systems.
(2): CNN-LSTM [32]: This model employs a hybrid CNN-LSTM deep learning model. It automatically extracts deep features from user ratings through CNNs and combines LSTM to learn sequence dependencies in order to accurately detect shilling attacks in recommendation systems.
(3): DGA-MFCA [29]: DGA-MFCA is a shilling attack detection approach. Users are first grouped based on their characteristics. Then, attackers are detected within these groups using Gaussian-RBF analysis of behavior patterns.
(4): USG-SAD [25]: A supervised learning method is proposed that constructs a user relational graph by computing user similarity through an integrated measure of rating-behavior correlation and deviation. User embeddings are subsequently generated using the Node2Vec algorithm. For attack detection, the model leverages a graph convolutional network (GCN) designed to operate on the user similarity graph, where it learns to assign importance weights to users for classification
(5): GraphRFI [23]: This model combines the GCN and NRF to realize recommendation and fraud detection in the recommendation system through graph structure modeling and random forest classification.
(6): ITRN [32]: This method divides the item scoring time series through key points, uses second-order difference to construct a cube for anomaly interval detection, and builds a bipartite graph of suspicious users–items. Combined with LightGCN to learn high-order neighbor features, it finally determines whether the user is an attacker through the linear layer and the Sigmoid function.

Figure 5 shows the comparison of the detection performances for CoDetector, CNN-LSTM, DGA-MFCA, USG-SAD, GraphRFI and the ITRN of the six baseline methods on the Netflix datasets.

As shown in Figure 5, the detection performance of MVCA exhibits a stronger performance in detecting attacking users in recommendation systems, which demonstrates the progressiveness and effectiveness of the multi-view cross-attention framework in detecting shilling attacks. The MVCA framework integrates three complementary views: the CNN local scoring model, the RNN temporal behavior sequence, and the GCN high-order graph structure. It designs a bidirectional cross-attention mechanism to achieve deep information interaction and dynamic weight distribution among views, forming goal-oriented collaborative optimization in end-to-end training, thereby comprehensively depicting the multi-dimensional characteristics of attack behaviors. In contrast, CoDetector and DGA-MFCA, which rely on matrix factorization or statistical features, lack the ability to model temporal dynamics and structural correlations. Although CNN-LSTM can extract deep features, it ignores the high-order collaborative relationships among users. Pure graph methods such as USG-SAD and GraphRFI fail to utilize timing bursts and local anomaly signals; ITRNs only adopt shallow splicing and fusion and are unable to capture nonlinear dependencies across views. The limitations of this single perspective or static fusion make it difficult for baseline methods to simultaneously identify the local anomalies, temporal aggregation, and structural concealment of attacks. However, MVCA adaptively correlates the key clues of different views through the attention mechanism, significantly enhancing the discriminative and robust nature of detection, and demonstrating better generalization in complex attack scenarios.

Given that the objective of a recommendation attack is to enhance or diminish the rating of a target item and increase or decrease the number of times the target item is recommended, we used the MovieLens 10M dataset in our experiments and adopted the three commonly used recommendation attack models, namely, the random attack, the average attack, and the popularity attack, as the method of generating the attack data because these attack models need rare knowledge of the recommendation systems. We injected these attack data into the system to simulate the shilling attack process, and since newly attacked users often lack trust relationship data, we randomly selected trustees and chose the average number of trustees in the original system as the number of trustees of the attacker. In contrast to the normal shilling attack data, we needed to generate the attacker’s trust relationship. As with the model for determining the size of the attacker’s profile, we used the average number of users on the system’s user trust list as the size of the attacker’s trust list. Therefore, to simulate real users and produce better attack results, we obtained the attacker’s trustees from users with high trust values.

To comprehensively evaluate the performances and impacts of the attack methods under different intensities, we set the fill sizes of the three shilling attack methods simulated in the experiment as 3%, 5%, 7%, 10% and 15%, and the fill sizes corresponded to the recommended attack sizes of 1%, 5%, 7%, 10% and 10%, respectively. The results of the experiment are shown in Table 2, Table 3 and Table 4.

As can be seen from Table 2, Table 3 and Table 4, MVCA maintained a stable performance at different attack sizes and filler sizes under average attack and random attack. The model demonstrated a strong performance under average attack, random attack and popularity attack. Moreover, with the increase in the attack size and fill size, the accuracy, recall rate and F1 score generally showed an upward trend. For instance, when the attack size was 15% and the fill size was 10%, the F1 scores of all attack types exceeded 0.94, reaching as high as 0.9783 under popular attacks. This indicates that when the scale of the attack expands, the model detection effect is better. Because a larger scale of the attack makes the attack behavior more obvious, it is easier for the multi-view feature extraction and bidirectional cross-attention mechanism of the model to dynamically fuse the information of different views, thereby capturing the attack features more comprehensively. Furthermore, the model remains robust against various attack types, verifying its ability to adapt to complex attack strategies by deeply integrating structure, timing, and local views.

To further prove the robustness of this method for the detection of MVCA, we conducted a thorough comparison of the performances of six baseline methods in different scenarios using the MovieLens dataset. The experimental results are shown in Figure 6, Figure 7 and Figure 8.

As shown in Figure 6, MVCA achieves a superior performance in detecting random attacks. While a baseline like CNN-LSTM can extract deep features and learn sequential dependencies, it fails to model the high-order relationships within the user–item interaction graph. Consequently, it cannot capture the collaborative patterns formed by attackers. USG-SAD and GraphRFI, which are pure graph methods, can identify structural anomalies, but they ignore the sudden pattern of ratings in the time dimension and have difficulty detecting distributed time attacks. The success of MVCA lies in the following: When random attackers randomly select to fill in items and assign random ratings, the CNN module detects the abnormality of the local rating distribution, the GRU module captures the sudden rating behavior in the time series, and the GCN module discovers the implicit connections between attackers through high-order neighbor aggregation. The cross-attention mechanism dynamically fuses these three complementary signals to form a more robust discriminative basis than single-view methods. Failure cases mainly occur at the extremely small attack scale (1%) and a low filling rate (3%). At this time, the attack signal is too weak, and it is difficult for the feature extraction of the three views to obtain sufficient discriminative information. This is similar to methods like CoDetector and DGA-MFCA based on statistical features, which face the challenge of signal-to-noise ratios under sparse attacks. The limitations of MVCA in this situation are particularly obvious: when the number of attackers is very small and the rating behavior is highly dispersed, the cross-attention mechanism lacks sufficient “evidence” to establish effective associations between views, the attention weights tend to be uniformly distributed, and the model degenerates into simple feature concatenation, resulting in a significant decline in discriminative ability.

According to Figure 7, in the detection of bandwagon attacks, MVCA demonstrates highest robustness, maintaining an F1-score above 0.92 across all attack scales, significantly outperforming methods such as the ITRN and CNN-LSTM. Although the ITRN constructs abnormal intervals through the use of second-order differences to build a time cube, its shallow concatenation fusion strategy fails to establish a nonlinear correlation between temporal features and graph structure features, resulting in a large number of false negatives when attackers imitate normal users to give high ratings to popular projects. CNN-LSTM, in contrast, lacks a global structural perspective and thus has difficulty identifying the group behavior patterns where attackers deliberately establish connections with popular projects to enhance credibility. The key to the success of MVCA lies in the following: in the GCN view, the popularity attackers form a clear structural aggregation; in the GRU view, they show temporal synchrony; in the CNN view, they present specific local rating patterns. The cross-attention mechanism effectively correlates these cross-view cues through bidirectional interaction; for example, when the GCN detects a user group that is closely connected in the graph structure, the attention mechanism will enhance the temporal anomaly weight of this group in the GRU view. This dynamic collaboration is something that simple feature concatenation methods (such as DGA-MFCA) cannot achieve. However, MVCA has specific limitations in such attacks: when attackers adopt the “slow popularity attack” strategy, that is, deliberately extending the attack time window, dispersing the rating moments to avoid the temporal detection of GRU, and controlling the number of connections with popular projects to reduce the structural significance in the GCN, it is difficult for the cross-attention mechanism to establish strong correlations between views, and the model may mistakenly identify the attackers as real user groups with normal preferences for popular projects. Moreover, if the system itself has a large number of real enthusiasts of popular projects, their structural aggregation and temporal active characteristics are highly similar to those of attackers, and the MVCA faces a higher risk of false positives.

According to Figure 8, in the average attack detection scenario, MVCA faces the greatest challenge, but it still outperforms all baseline methods overall. The underlying reason for this challenge lies in the “perfect camouflage” characteristic of the average attack: attackers set the filled item ratings close to the item average, making detection methods based on statistical anomalies (such as matrix decomposition in CoDetector and Bayesian inference in BayesDetector) almost ineffective; at the same time, the decentralized time strategy weakens the detection capabilities of pure sequence models (such as CNN-LSTM); and imitating the rating patterns of normal users also makes CNN methods relying solely on local features difficult to distinguish. In this scenario, the success of MVCA over other methods mainly relies on the high-order structural information of the GCN view: when the attack scale reaches more than 5%, the attacker forms a synergy on the target item to improve the effect, thereby generating detectable community structures in the user–item interaction graph. At this time, the cross-attention mechanism can automatically reduce its reliance on the CNN and GRU views and enhance the weight of the GCN view, achieving adaptive feature selection. In contrast, USG-SAD, although using a GCN, lacks deep interaction with sequence features and cannot utilize time information to assist in verifying structural anomalies; GraphRFI combines the GCN with random forests, but its static cascading design limits the information flow between views. The failure cases of MVCA are concentrated in small-scale and low-fill-rate combinations because the rating behavior of the average attacker is highly similar to that of real users, and when the number of attackers is small, it is difficult to form significant clustering in the graph structure, resulting in the simultaneous limitation of the discriminative ability of the three views. Here, the fundamental limitation of MVCA is clearly exposed: When the attacker carefully designs to simultaneously evade the detection of all three views—that is, normal local rating distribution, dispersed time behavior, and sparse graph structure connection—the cross-attention mechanism falls into the predicament because no single view can provide reliable query clues to guide the feature selection of other views. In this case, even by increasing the number of attention heads or adjusting the network depth, it is difficult to break through the discriminative limit at the information theory level.

Overall, the advantage of MVCA lies in multi-view fusion. However, when a certain type of attack causes a sharp decline in the discriminative power of one or two views, the model performance will be compared. Although the cross-attention mechanism attempts to dynamically adjust the weights, the absence or weakening of the core information source inevitably leads to the model’s overall sensitivity to this type of attack being lower than that of other types, thereby causing greater performance fluctuations. For instance, in terms of average attacks, attackers greatly downplay the anomalies in local rating patterns and temporal behaviors by imitating the average ratings of normal users. The model mainly relies on GCN views to discover the “graph structure evidence” of collaborative attacks. Only when the scale of the attack reaches a certain level and these attackers form a sufficiently tight and identifiable cluster in the graph structure can the GCN module provide decisive features and the model performance experience a leap. Before this, the performance improvement is relatively gentle.

Moreover, in this study, we utilized t-SNE, a powerful technique proposed by Hinton [33], to visualize high-dimensional implicit features and provide an intuitive understanding of the distribution of original user features. Figure 8 presents a 3D t-SNE visualization that effectively demonstrates the separation between genuine users and malicious attackers within our MVCA framework. The visualization employs a dual-marker scheme, with the blue spheres representing authentic users and the red tetrahedrons denoting attackers.

Notably, while attackers are distributed among legitimate users, they form distinct clusters with characteristic dispersion patterns, reflecting variations in their attack methodologies. These distinct clusters and dispersion patterns are crucial evidence of the effectiveness of our MVCA framework in identifying different types of attackers. The framework can capture the subtle differences in the high-dimensional feature space, which are not easily discernible through conventional methods.

The substantial overlap observed between certain attacker subgroups and genuine user populations suggests sophisticated behavioral emulation strategies employed by attackers. This overlap highlights cases where attackers successfully mimic normal user behavior patterns, thereby evading conventional detection mechanisms. However, despite this overlap, our MVCA framework is still able to identify these attackers by leveraging the unique dispersion patterns and subtle differences in their feature distributions. This demonstrates the robustness and effectiveness of our method in detecting malicious activities even in the presence of sophisticated attacks.

In summary, Figure 9 provides a clear and intuitive visualization of the separation between genuine users and malicious attackers, showcasing the effectiveness of our MVCA framework in identifying and distinguishing different types of attackers based on their high-dimensional implicit features.

4.4. Optimization of Model Hyperparameters and Setting of Key Parameters

To achieve the optimal performance of MVCA, cross-validation experiments were first conducted, and some of them were optimized. Critical parameters exist in a large number of hyperparameters. After all the hyperparameters were determined, we trained the model using a combination and evaluated it on two benchmark test sets. The parameter settings of the proposed method used in the following experiments were as shown in Table 5.

When attackers carry out co-access injection attacks, they frequently click to pop up the target items, thereby increasing the exposure of the target item. Fake users derived from the same group often launch attacks on the same target items, thereby inducing the abnormality of connectivity in the user–project interaction graph. To obtain the connected features in such interaction graphs, we achieve modeling by superimposing the embedding propagation layer of multi-layer graph convolution. Among them, the influence of the number of propagation layers (k) in the GCN is the key hyperparameter of the model. To explore the influence of this parameter on the experimental results, in this study, under the condition of fixing the other parameters, the embedding propagation depth (k) was set to 1, 2, 3, and 4 to conduct parameter optimization experiments. Figure 10 shows the influence of the number of propagation layers (k) on the classification performance. Experiments show that increasing the embedding propagation substantially enhances the model’s efficacy. When k = 3, the overall performance of all indicators is the best, indicating that the third-order embedding propagation can effectively reflect the interaction behavior information between users and items. However, compared with k = 4, some evaluation indicators of k = 3 decreased to a certain extent. The reason is that the overly deep network structure may introduce noise, thereby leading to overfitting and reduced model effect. In general, a three-layer embedded propagation structure is sufficient to effectively obtain interaction information and identify abnormal behavioral patterns.

4.5. Ablation Study

To demonstrate the effectiveness of the proposed modules, we conducted four additional experiments to validate the efficacy of the MVCA. The experimental results on the Netflix dataset are presented in Table 6. In these experiments, we made the following modifications based on the MVCA approach: We removed the CNN module. Only the features fused by the GCN and RNN through cross-attention were used for classification. We removed the entire GCN module and its related cross-attention paths. Only the local features of the CNN and the temporal features of the RNN were used. We removed the entire RNN module and its related cross-attention paths. Only the features of the CNN and GCN were integrated. We removed the entire bidirectional cross-attention module. The features extracted by the GCN, RNN and CNN were directly concatenated and then input into the final MLP classifier.

As shown in Table 6, the ablation study demonstrates the key roles of each module in the MVCA framework. Based on the CNN, the local perspective can capture fine rating anomalies, such as atypical combinations of project ratings, but its limited perception range makes it vulnerable to attacks that meticulously imitate local statistical patterns. When this module is removed, the F1 score drops from 0.9322 to 0.8872. This indicates that although the local perspective provides the basic ability to detect micro-level anomalies, its independent contribution among the three single perspectives is the smallest, suggesting that relying solely on local patterns is insufficient for achieving robust detection. The sequence perspective based on the GRU effectively simulates temporal dynamics and can detect sudden activities and synchronous attack windows; however, its discriminative ability is weakened by the time dispersion strategy, which dilutes the sequence cues. Removing this perspective leads to a further decrease in the F1 score to 0.8386, highlighting the importance of capturing temporal patterns for predictive results. The structural perspective based on the GCN is crucial for revealing potential collusion communities and high-order interaction patterns, but its effectiveness is weakened when attackers deliberately limit the connectivity of the graph, or when the attack group is too sparse to form detectable clusters. Its deletion resulted in the performance of the F1 evaluation metric dropping by up to 0.8223, demonstrating that it is the most critical. Finally, the bidirectional cross-attention mechanism can achieve the adaptive fusion of these perspectives by dynamically adjusting the weights of key signals. However, its performance depends on the quality of the input representations. If all perspectives can only provide weak discriminative evidence, the attention distribution will tend to be uniform, simplifying the mechanism to a shallow average value. Without this fusion mechanism, the F1 evaluation metric drops to 0.8617, indicating its key role as a coordinator. In summary, the ablation study confirms that all three modules are essential for achieving an optimal detection performance and a significant advantage.

4.6. Confusion Matrix Analysis

To conduct a more detailed evaluation of the proposed MVCA framework, we present the confusion matrix based on a representative experimental environment. Since the average fill rate of the MovieLens 10M dataset is 1.34%, we used a subset of the MovieLens dataset with a fill rate of 1.34% for the experiment. The purpose of this setting was to simulate the data distribution of the attack profile in line with the real profile to increase the difficulty of detection, as shown in Figure 11. This model achieved a large number of true positive and true negative numbers, indicating that it has a strong ability to distinguish between real users and attackers.

Based on the confusion matrix, we further compute the True Positive Rate (TPR) and False Positive Rate (FPR), two critical metrics for imbalanced detection tasks. The TPR reaches 0.945, indicating that most attack users are correctly identified, while the FPR is as low as 0.038, demonstrating the model’s ability to avoid misclassifying genuine users. These results validate the robustness and reliability of MVCA in real-world deployment scenarios.

4.7. Time Overhead

In order to improve the overall performance of the model, a detailed analysis and representation of its runtime overhead were conducted. We define the graph structure construction and GCN feature extraction module part as Module A, the sequence modeling module and sequence feature extraction part as Module B, the CNN local feature extraction part as Module C, the interactive attention fusion module as Module D, and the detector training and prediction part as Module E. The time costs of each module under the same system conditions are shown in Table 7.

The time consumption of each module is shown in Table 7. The experimental results show that the MVCA model demonstrates excellent scalability in terms of time consumption, with its computational bottlenecks mainly concentrated in the sequence modeling and cross-attention fusion modules. Although introducing higher computational costs in multi-view fusion, the resulting improvement in the detection performance is significant. In the future, the further optimization of the time efficiency through methods such as model pruning, knowledge distillation, or lightweight attention mechanisms can be considered to meet the deployment requirements of larger-scale real-time recommendation systems.

5. Conclusions

In this paper, we propose a multi-view cross-attention framework for shilling attack detection in recommender systems. The framework integrates three complementary views—user–item interaction graphs, temporal behavior sequences, and local rating patterns—and employs a bidirectional cross-attention mechanism to enable deep interaction among them. This design allows the model to capture complex and evolving attack behaviors more effectively than single-view approaches. Extensive experiments validate the superiority of MVCA towards competitive baselines. Ablation studies further validate the necessity of each module: removing any single view or the cross-attention mechanism leads to a noticeable drop in the detection performance, highlighting the importance of multi-view synergy. The experimental results show that our proposed method outperformed all other comparison methods. It demonstrated a robust performance under different attack scales and types on the MovieLens 10M dataset. Moreover, in the experiments on Netflix, the effectiveness of our method was verified.

However, this method still has some limitations. The model complexity is relatively high, and the computational overhead is large. Especially when deployed in large-scale systems, it may face efficiency challenges. Moreover, the model is highly dependent on the integrity of multi-view data. If the information of a certain view is missing or of poor quality, it may affect the overall detection performance. Future work can be carried out by optimizing the model structure and introducing lightweight design or knowledge distillation technology to enhance the efficiency and further enhance the generalization ability and practicality of the model.

Author Contributions

Z.Z.: funding acquisition, supervision, item administration, resources, formal analysis, writing—review and editing. C.X.: writing—original draft, writing—review and editing, methodology, software, validation. Y.L.: writing—original draft, writing—review and editing, conceptualization, visualization, methodology, formal analysis. S.S.: data curation, investigation, validation. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is partially supported by the National Nature Science Foundation of China under Grant 61906104 and Grant 61502262.

Data Availability Statement

Data is available upon request from the authors.

Acknowledgments

The authors greatly appreciate the comments from the reviewers, which helped improve the quality of this paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

He, X.; Liao, L.; Zhang, H. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Zhang, S.; Yao, L.; Sun, A. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2019, 52, 1–38. [Google Scholar] [CrossRef]
Li, B.; Wang, Y.; Singh, A. Data Poisoning Attacks on Factorization-Based Collaborative Filtering. Adv. Neural Inf. Process. Syst. 2016, 29, 1893–1901. [Google Scholar] [CrossRef]
Masmoudi, M.; Amous, I.; Zayani, C.A.; Sèdes, F. Trust Attack Prevention Based on Spark-Blockchain in Social IoT: A Survey. Int. J. Inf. Secur. 2024, 23, 3179–3198. [Google Scholar] [CrossRef]
Burke, R. Multisided Fairness for Recommendation. arXiv 2017, arXiv:1707.00093. [Google Scholar] [CrossRef]
Mehta, R.; Rana, K. A Review on Matrix Factorization Techniques in Recommender Systems. In Proceedings of the 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA), Mumbai, India, 7–8 April 2017; pp. 269–274. Available online: https://ieeexplore.ieee.org/abstract/document/8066567 (accessed on 25 January 2026).
Wang, X.; He, X.; Wang, M. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Tang, J.; Wang, K. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 565–573. [Google Scholar]
Gunes, I.; Kaleli, C.; Bilge, A. Shilling Attacks Against Recommender Systems: A Comprehensive Survey. Artif. Intell. Rev. 2014, 42, 767–799. [Google Scholar] [CrossRef]
Nawara, D.; Aly, A.; Kashef, R. Shilling Attacks and Fake Reviews Injection: Principles, Models, and Datasets. IEEE Trans. Comput. Soc. Syst. 2024, 11, 362–375. [Google Scholar] [CrossRef]
Wu, S.; Tang, Y.; Zhu, Y. Session-Based Recommendation with Graph Neural Networks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 346–353. [Google Scholar] [CrossRef]
Sun, F.; Liu, J.; Wu, J. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar]
Zayed, R.A.; Ibrahim, L.F.; Hefny, H.A. Experimental and Theoretical Study for the Popular Shilling Attacks Detection Methods in Collaborative Recommender System. IEEE Access 2023, 11, 79358–79369. [Google Scholar] [CrossRef]
Mehta, B.; Hofmann, T. A Survey of Attack-Resistant Collaborative Filtering Algorithms. IEEE Data Eng. Bull. 2008, 31, 14–22. [Google Scholar]
Rezaimehr, F.; Dadkhah, C. A Survey of Attack Detection Approaches in Collaborative Filtering Recommender Systems. Artif. Intell. Rev. 2021, 54, 2011–2066. [Google Scholar] [CrossRef]
Chen, R.; Hua, Q.; Chang, Y.S. A Survey of Collaborative Filtering-Based Recommender Systems: From Traditional Methods to Hybrid Methods Based on Social Networks. IEEE Access 2018, 6, 64301–64320. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Wang, Y. Trustworthy Recommender Systems. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–20. [Google Scholar] [CrossRef]
Lara-Gutierrez, A.; Fernandez-Gago, C.; Onieva, J.A. A Framework for Drift Detection and Adaptation in AI-Driven Anomaly and Threat Detection Systems. Int. J. Inf. Secur. 2025, 24, 199. [Google Scholar] [CrossRef]
Turk, A.M.; Bilge, A. Robustness Analysis of Multi-Criteria Collaborative Filtering Algorithms Against Shilling Attacks. Expert Syst. Appl. 2019, 115, 386–402. [Google Scholar] [CrossRef]
Huang, H.; Mu, J.; Gong, N.Z. Data Poisoning Attacks to Deep Learning Based Recommender Systems. arXiv 2021, arXiv:2101.02644. [Google Scholar] [CrossRef]
Tian, Z.; Cui, L.; Liang, J. A Comprehensive Survey on Poisoning Attacks and Countermeasures in Machine Learning. ACM Comput. Surv. 2022, 55, 1–35. [Google Scholar] [CrossRef]
Mehta, B.; Nejdl, W. Unsupervised Strategies for Shilling Detection and Robust Collaborative Filtering. User Model. User-Adapt. Interact. 2009, 19, 65–97. [Google Scholar] [CrossRef]
Yang, F.; Gao, M.; Yu, J. Detection of Shilling Attack Based on Bayesian Model and User Embedding. In Proceedings of the IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), Volos, Greece, 5–7 November 2018; pp. 639–646. [Google Scholar]
Zhang, S.; Yin, H.; Chen, T. GCN-Based User Representation Learning for Unifying Robust Recommendation and Fraudster Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 689–698. [Google Scholar]
Zhang, Y.; Hao, Q.; Zheng, W. User Similarity-Based Graph Convolutional Neural Network for Shilling Attack Detection. Appl. Intell. 2025, 55, 340. [Google Scholar] [CrossRef]
Zhou, Q.; Wu, J.; Duan, L. Recommendation Attack Detection Based on Deep Learning. J. Inf. Secur. Appl. 2020, 52, 102493. [Google Scholar] [CrossRef]
Zhou, Q.; Huang, C. A Recommendation Attack Detection Approach Integrating CNN with Bagging. Comput. Secur. 2024, 146, 104030. [Google Scholar] [CrossRef]
Dou, T.; Yu, J.; Xiong, Q. Collaborative Shilling Detection Bridging Factorization and User Embedding. In Proceedings of the International Conference on Collaborative Computing: Networking, Applications and Worksharing, Shanghai, China, 12–14 November 2017; pp. 459–469. [Google Scholar]
Xu, Y.; Zhang, P.; Yu, H. Detecting Group Shilling Attacks in Recommender Systems Based on User Multi-Dimensional Features and Collusive Behaviour Analysis. Comput. J. 2024, 67, 604–616. [Google Scholar] [CrossRef]
Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef]
Narayanan, A.; Shmatikov, V. How to Break Anonymity of the Netflix Prize Dataset. arXiv 2006, arXiv:cs/0610105. [Google Scholar]
Liu, H.; Ji, K.; Ma, K. Robust Recommendation-Oriented Malicious Attack Detection Method. Inf. Sci. 2025, 715, 122213. [Google Scholar] [CrossRef]
Maaten, L.V.D.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The detection framework of MVCA.

Figure 2. The architecture of the CNN-based feature module.

Figure 3. The GRU structure for information preservation.

Figure 4. The architectural design of the GCN-based interaction feature extractor.

Figure 5. Comparison of detection performances for different methods on Netflix dataset.

Figure 6. Five methods for detecting random attacks on MovieLens dataset.

Figure 7. Five methods for detecting bandwagon attacks on MovieLens dataset.

Figure 8. Five methods for detecting average attacks on MovieLens dataset.

Figure 9. 3D visualization of MVCA user behavior.

Figure 10. Analysis of influence of number of propagation layers (k) in GCN.

Figure 11. Confusion matrix of MVCA based on MovieLens dataset.

Table 1. Division and construction tactics of attack models.

r_{\max}

and

r_{\min}

represent the highest and lowest rating values in the Collaborative Recommender System.

μ

and

μ_{i}

represent the mean values of all items and Item i, respectively.

σ

and

σ_{i}

represent the standard deviation of all items and Item i, respectively.

N (μ, σ)

represents obtaining a random number that conforms to a normal distribution.

r_{i} = r_{m a x}

for push attacks and

r_{i} = r_{\min}

for nuke attacks.

Table 1. Division and construction tactics of attack models.

r_{\max}

and

r_{\min}

represent the highest and lowest rating values in the Collaborative Recommender System.

μ

and

μ_{i}

represent the mean values of all items and Item i, respectively.

σ

and

σ_{i}

represent the standard deviation of all items and Item i, respectively.

N (μ, σ)

represents obtaining a random number that conforms to a normal distribution.

r_{i} = r_{m a x}

for push attacks and

r_{i} = r_{\min}

for nuke attacks.

Attack Models	Division and Construction Tactics
Attack Models	$I_{S}$	$I_{F}$	$I_{N}$	$I_{t}$
Random	$\emptyset$	$Items are randomly selected from I - I_{t}$ $. For i \in I_{F}, r_{i} \sim N (μ, σ)$	$Do not rate items in I_{N}$	$For i \in I_{t},$ $r_{i} = r_{\max}$ $or r_{\min}$
Average	$\emptyset$	$Items are randomly selected from I - I_{t}$ $. For i \in I_{F}, r_{i}$ $~ N (μ_{i}, σ_{i})$	$Do not rate items in I_{N}$	$For i \in I_{t},$ $r_{i} = r_{\max}$ $or r_{\min}$
Bandwagon	K most popular items. $For i \in I_{s}$ $, r_{i} = r_{m a x}$	$Items are randomly selected from I - I_{S} - I_{t}$ $. For i \in I_{F}, r_{i} \sim N (μ, σ)$

Table 2. The results of the MVCA under average attack with different attack sizes and filler sizes.

	PAttackSize	3%	5%	7%	10%	15%
PFillerSize		3%	5%	7%	10%	15%
1%	Precision	0.7806 ± 0.0212	0.8051 ± 0.0237	0.8449 ± 0.0162	0.8751 ± 0.0085	0.9343 ± 0.0042
	Recall	0.8207 ± 0.0194	0.8430 ± 0.0164	0.8652 ± 0.0124	0.9051 ± 0.0073	0.9252 ± 0.0053
	F1-score	0.8382 ± 0.0221	0.7841 ± 0.0129	0.8741 ± 0.0133	0.9246 ± 0.0075	0.9148 ± 0.0081
5%	Precision	0.8112 ± 0.0152	0.8662 ± 0.0095	0.8817 ± 0.0094	0.9277 ± 0.0052	0.9707 ± 0.0043
	Recall	0.8013 ± 0.0183	0.8768 ± 0.0082	0.9121 ± 0.0087	0.9182 ± 0.0077	0.9816 ± 0.0034
	F1-score	0.8190 ± 0.0142	0.8778 ± 0.0063	0.9373 ± 0.0064	0.9403 ± 0.0081	0.9647 ± 0.0043
7%	Precision	0.8456 ± 0.0872	0.8631 ± 0.0085	0.8720 ± 0.0042	0.9316 ± 0.0054	0.9649 ± 0.0024
	Recall	0.8157 ± 0.0103	0.8939 ± 0.0072	0.9023 ± 0.0086	0.9417 ± 0.0032	0.9749 ± 0.0048
	F1-score	0.8256 ± 0.0076	0.8906 ± 0.0058	0.8509 ± 0.0052	0.9213 ± 0.0041	0.9647 ± 0.0039
10%	Precision	0.8328 ± 0.0033	0.8566 ± 0.0042	0.9052 ± 0.0026	0.9410 ± 0.0062	0.9776 ± 0.0025
	Recall	0.8231 ± 0.0027	0.8772 ± 0.0035	0.9253 ± 0.0028	0.9613 ± 0.0074	0.9877 ± 0.0031
	F1-score	0.8720 ± 0.0041	0.8350 ± 0.0038	0.9392 ± 0.0031	0.9496 ± 0.0030	0.9772 ± 0.0015

Table 3. The results of the MVCA under random attack with different attack sizes and filler sizes.

	PAttackSize	3%	5%	7%	10%	15%
PFillerSize		3%	5%	7%	10%	15%
1%	Precision	0.7742 ± 0.0223	0.7982 ± 0.0194	0.8424 ± 0.0125	0.9372 ± 0.0077	0.9542 ± 0.0032
	Recall	0.7903 ± 0.0205	0.8132 ± 0.0204	0.8631 ± 0.0096	0.9451 ± 0.0062	0.9553 ± 0.0047
	F1-score	0.7967 ± 0.0274	0.7942 ± 0.0155	0.8422 ± 0.0168	0.9346 ± 0.0058	0.9448 ± 0.0074
5%	Precision	0.8122 ± 0.0116	0.8334 ± 0.0087	0.8962 ± 0.0092	0.9582 ± 0.0052	0.9701 ± 0.0051
	Recall	0.7922 ± 0.0134	0.8432 ± 0.0072	0.9021 ± 0.0073	0.9701 ± 0.0057	0.9613 ± 0.0042
	F1-score	0.7809 ± 0.0098	0.8613 ± 0.0063	0.9073 ± 0.0042	0.9649 ± 0.0043	0.9627 ± 0.0026
7%	Precision	0.8073 ± 0.0078	0.8342 ± 0.0076	0.8891 ± 0.0043	0.9245 ± 0.0025	0.9515 ± 0.0032
	Recall	0.8057 ± 0.0087	0.8239 ± 0.0054	0.8934 ± 0.0058	0.9117 ± 0.0056	0.9448 ± 0.0056
	F1-score	0.7956 ± 0.0065	0.8406 ± 0.0049	0.8902 ± 0.0061	0.9213 ± 0.0024	0.9435 ± 0.0033
10%	Precision	0.8461 ± 0.0032	0.9042 ± 0.0054	0.9212 ± 0.0028	0.9432 ± 0.0045	0.9712 ± 0.0018
	Recall	0.8531 ± 0.0062	0.9172 ± 0.0042	0.9123 ± 0.0035	0.9321 ± 0.0031	0.9607 ± 0.0034
	F1-score	0.8720 ± 0.0048	0.9030 ± 0.0023	0.9226 ± 0.0029	0.9366 ± 0.0028	0.9783 ± 0.0022

Table 4. The results of the MVCA under bandwagon attack with different attack sizes and filler sizes.

	PAattackSize	3%	5%	7%	10%	15%
PFillerSize		3%	5%	7%	10%	15%
1%	Precision	0.8270 ± 0.0174	0.8721 ± 0.0142	0.9142 ± 0.0080	0.9423 ± 0.0056	0.9751 ± 0.0044
	Recall	0.8312 ± 0.0213	0.8804 ± 0.0165	0.9104 ± 0.0098	0.9364 ± 0.0041	0.9633 ± 0.0075
	F1-score	0.8256 ± 0.0142	0.8672 ± 0.0107	0.9230 ± 0.0072	0.9421 ± 0.0049	0.9721 ± 0.0053
5%	Precision	0.8421 ±0.0108	0.8843 ± 0.0084	0.9012 ± 0.0069	0.9477 ± 0.0054	0.9761 ± 0.0042
	Recall	0.8322 ± 0.0086	0.8921 ± 0.0063	0.9033 ± 0.0082	0.9508 ± 0.0058	0.9724 ± 0.0065
	F1-score	0.8107 ± 0.0112	0.8732 ± 0.0054	0.8938 ± 0.0064	0.9402 ± 0.0071	0.9807 ± 0.0042
7%	Precision	0.8251 ± 0.0058	0.8723 ± 0.0048	0.9212 ± 0.0043	0.9326 ± 0.0031	0.9584 ± 0.0039
	Recall	0.7923 ± 0.0061	0.8932 ± 0.0061	0.9137 ± 0.0050	0.9287 ± 0.0024	0.9532 ± 0.0053
	F1-score	0.7944 ± 0.0046	0.8955 ± 0.0052	0.9205 ± 0.0033	0.9343 ± 0.0036	0.9621 ± 0.0045
10%	Precision	0.8451 ± 0.0041	0.8963 ± 0.0041	0.9211 ± 0.0062	0.9421 ± 0.0037	0.9801 ± 0.0042
	Recall	0.8424 ± 0.0025	0.8762 ± 0.0038	0.9301 ± 0.0043	0.9361 ± 0.0051	0.9789 ± 0.0029
	F1-score	0.8122 ± 0.0039	0.8745 ± 0.0047	0.9250 ± 0.0055	0.9452 ± 0.0028	0.9773 ± 0.0038

Table 5. Hyperparameters used in this study.

Hyperparameter	Setting
Optimizer	RMSProp
Learning rate	0.0002
Dropout	0.5
Epoch	500
Batch size	1024
Reg.lambda -u and -i	0.8/0.4
Gamma	1
Kernel size	5
Kernel number	64
Dimension of hidden layer in GRU	128
Length of behavioral sequence	50
Number of propagation layers (k)	3
Cross-attention dimension	128
Number of attention heads	[4, 8]

Table 6. Comparison of experiment results between MVCA and its variants.

Method	Precision	Recall	F1-Measure
w/o Local View	0.9025 ± 0. 0148	0.8951 ± 0.0134	0.8872 ± 0.0147
w/o Sequential View	0.8380 ± 0.0152	0.8305 ± 0.0140	0.8386 ± 0.0095
w/o Structural View	0.7958 ± 0.0126	0.8106 ± 0.0134	0.8223 ± 0.0121
w/o Cross-Attention	0.8758 ± 0.0053	0.8825 ± 0.0062	0.8617 ± 0.0074
MVCA	0.9341 ± 0.0052	0.9205 ± 0.0041	0.9322 ± 0.0038

The best results for each metric are highlighted in bold.

Table 7. Computational time of different modules on two datasets.

Module	Netflix	MovieLens 10M
A	3.1 min	14.1 min
B	12.3 min	51.6 min
C	2.5 min	11.5 min
D	8.7 min	45.7 min
E	1.8 min	6.9 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhai, Z.; Xu, C.; Li, Y.; Su, S. MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection. Symmetry 2026, 18, 497. https://doi.org/10.3390/sym18030497

AMA Style

Zhai Z, Xu C, Li Y, Su S. MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection. Symmetry. 2026; 18(3):497. https://doi.org/10.3390/sym18030497

Chicago/Turabian Style

Zhai, Zhengli, Cheng Xu, Yang Li, and Shunqi Su. 2026. "MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection" Symmetry 18, no. 3: 497. https://doi.org/10.3390/sym18030497

APA Style

Zhai, Z., Xu, C., Li, Y., & Su, S. (2026). MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection. Symmetry, 18(3), 497. https://doi.org/10.3390/sym18030497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MVCA: Multi-View Cross-Attention Framework for Robust Shilling Attack Detection

Abstract

1. Introduction

2. Related Work

2.1. Shilling Attack Modules

2.2. Detection of Shilling Attacks in Recommendation Systems

3. The Proposed Method

3.1. CNN-Based Feature Extraction

3.2. Sequential Feature Extraction

3.3. GCN-Based Feature Extraction

3.4. Interactive Cross-Attention Module

3.5. Feature Fusion for Attack Detection

3.6. Complexity Analysis

4. Experiment Results and Analysis

4.1. Experimental Datasets

4.2. Evaluation Metrics

4.3. Experiment Results and Analysis

4.4. Optimization of Model Hyperparameters and Setting of Key Parameters

4.5. Ablation Study

4.6. Confusion Matrix Analysis

4.7. Time Overhead

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI