4.1. Experimental Datasets
In our experiment, we utilized two commonly used public datasets to verify our experimental results: MovieLens 10M [
30] and Netflix [
31], which contain real information configuration files. Each piece of data represents the attribute information corresponding to a click behavior. The original dataset includes a user id, an item id, user features, item features, a timestamp, and the tag corresponding to each piece of data. Both of these datasets used in the attack archive were generated by the attack model. The details of each dataset are shown as follows:
MovieLens 10M: This dataset contains 1120 users, 10,657 movies and 199,040 ratings. Each user has at least 20 ratings. The rating values are between 1 and 5. The MovieLens dataset does not include any attacker profiles. For the purposes of this research, without loss of generality, fake profiles generated by attack models were injected into the dataset.
The Netflix dataset is a competition dataset, consisting of 17,770 records (each record is a quadruple of the user ID, item ID, rating and timestamp) and 480,189 users. All rating values are integer data between 1 and 5 and were collected between October 1998 and December 2005. We randomly selected 215,884 ratings from 2000 users for 4000 movies as the experimental dataset. For the purpose of this study, the data generated by the forged personal data attack model was injected into the dataset without losing generality or labels corresponding to each piece of data.
In the attack detection phase, we split the original training set into a new training set and a validation set with a ratio of 8:2. The dataset provides a test set. We processed the datasets, adjusting the attack size and fill scale to the size we needed, and generated real datasets with different attack scales and fill scales. These profiles varied in attack sizes and filler sizes. To ensure the stability and reliability of the results and to reduce the influence of random factors, in addition, we conducted random sampling (without replacement sampling) on the MovieLens 10M dataset, and we performed this operation for each set. All reported experimental data are the average of five independent runs, and the standard deviations of the key metrics are also indicated to show the stability of the results.
4.3. Experiment Results and Analysis
To verify the validity of our method, we used the following works as comparative methods in our experiments:
- (1)
CoDetector [
28]: The fundamental concept of this model is to integrate matrix decomposition and word embedding techniques in order to uncover the latent features of users through both the user–user symbiosis matrix and the user–item symbiosis matrix. This approach facilitates the detection of attack behaviors within recommender systems.
- (2)
CNN-LSTM [
32]: This model employs a hybrid CNN-LSTM deep learning model. It automatically extracts deep features from user ratings through CNNs and combines LSTM to learn sequence dependencies in order to accurately detect shilling attacks in recommendation systems.
- (3)
DGA-MFCA [
29]: DGA-MFCA is a shilling attack detection approach. Users are first grouped based on their characteristics. Then, attackers are detected within these groups using Gaussian-RBF analysis of behavior patterns.
- (4)
USG-SAD [
25]: A supervised learning method is proposed that constructs a user relational graph by computing user similarity through an integrated measure of rating-behavior correlation and deviation. User embeddings are subsequently generated using the Node2Vec algorithm. For attack detection, the model leverages a graph convolutional network (GCN) designed to operate on the user similarity graph, where it learns to assign importance weights to users for classification
- (5)
GraphRFI [
23]: This model combines the GCN and NRF to realize recommendation and fraud detection in the recommendation system through graph structure modeling and random forest classification.
- (6)
ITRN [
32]: This method divides the item scoring time series through key points, uses second-order difference to construct a cube for anomaly interval detection, and builds a bipartite graph of suspicious users–items. Combined with LightGCN to learn high-order neighbor features, it finally determines whether the user is an attacker through the linear layer and the Sigmoid function.
Figure 5 shows the comparison of the detection performances for CoDetector, CNN-LSTM, DGA-MFCA, USG-SAD, GraphRFI and the ITRN of the six baseline methods on the Netflix datasets.
As shown in
Figure 5, the detection performance of MVCA exhibits a stronger performance in detecting attacking users in recommendation systems, which demonstrates the progressiveness and effectiveness of the multi-view cross-attention framework in detecting shilling attacks. The MVCA framework integrates three complementary views: the CNN local scoring model, the RNN temporal behavior sequence, and the GCN high-order graph structure. It designs a bidirectional cross-attention mechanism to achieve deep information interaction and dynamic weight distribution among views, forming goal-oriented collaborative optimization in end-to-end training, thereby comprehensively depicting the multi-dimensional characteristics of attack behaviors. In contrast, CoDetector and DGA-MFCA, which rely on matrix factorization or statistical features, lack the ability to model temporal dynamics and structural correlations. Although CNN-LSTM can extract deep features, it ignores the high-order collaborative relationships among users. Pure graph methods such as USG-SAD and GraphRFI fail to utilize timing bursts and local anomaly signals; ITRNs only adopt shallow splicing and fusion and are unable to capture nonlinear dependencies across views. The limitations of this single perspective or static fusion make it difficult for baseline methods to simultaneously identify the local anomalies, temporal aggregation, and structural concealment of attacks. However, MVCA adaptively correlates the key clues of different views through the attention mechanism, significantly enhancing the discriminative and robust nature of detection, and demonstrating better generalization in complex attack scenarios.
Given that the objective of a recommendation attack is to enhance or diminish the rating of a target item and increase or decrease the number of times the target item is recommended, we used the MovieLens 10M dataset in our experiments and adopted the three commonly used recommendation attack models, namely, the random attack, the average attack, and the popularity attack, as the method of generating the attack data because these attack models need rare knowledge of the recommendation systems. We injected these attack data into the system to simulate the shilling attack process, and since newly attacked users often lack trust relationship data, we randomly selected trustees and chose the average number of trustees in the original system as the number of trustees of the attacker. In contrast to the normal shilling attack data, we needed to generate the attacker’s trust relationship. As with the model for determining the size of the attacker’s profile, we used the average number of users on the system’s user trust list as the size of the attacker’s trust list. Therefore, to simulate real users and produce better attack results, we obtained the attacker’s trustees from users with high trust values.
To comprehensively evaluate the performances and impacts of the attack methods under different intensities, we set the fill sizes of the three shilling attack methods simulated in the experiment as 3%, 5%, 7%, 10% and 15%, and the fill sizes corresponded to the recommended attack sizes of 1%, 5%, 7%, 10% and 10%, respectively. The results of the experiment are shown in
Table 2,
Table 3 and
Table 4.
As can be seen from
Table 2,
Table 3 and
Table 4, MVCA maintained a stable performance at different attack sizes and filler sizes under average attack and random attack. The model demonstrated a strong performance under average attack, random attack and popularity attack. Moreover, with the increase in the attack size and fill size, the accuracy, recall rate and F1 score generally showed an upward trend. For instance, when the attack size was 15% and the fill size was 10%, the F1 scores of all attack types exceeded 0.94, reaching as high as 0.9783 under popular attacks. This indicates that when the scale of the attack expands, the model detection effect is better. Because a larger scale of the attack makes the attack behavior more obvious, it is easier for the multi-view feature extraction and bidirectional cross-attention mechanism of the model to dynamically fuse the information of different views, thereby capturing the attack features more comprehensively. Furthermore, the model remains robust against various attack types, verifying its ability to adapt to complex attack strategies by deeply integrating structure, timing, and local views.
To further prove the robustness of this method for the detection of MVCA, we conducted a thorough comparison of the performances of six baseline methods in different scenarios using the MovieLens dataset. The experimental results are shown in
Figure 6,
Figure 7 and
Figure 8.
As shown in
Figure 6, MVCA achieves a superior performance in detecting random attacks. While a baseline like CNN-LSTM can extract deep features and learn sequential dependencies, it fails to model the high-order relationships within the user–item interaction graph. Consequently, it cannot capture the collaborative patterns formed by attackers. USG-SAD and GraphRFI, which are pure graph methods, can identify structural anomalies, but they ignore the sudden pattern of ratings in the time dimension and have difficulty detecting distributed time attacks. The success of MVCA lies in the following: When random attackers randomly select to fill in items and assign random ratings, the CNN module detects the abnormality of the local rating distribution, the GRU module captures the sudden rating behavior in the time series, and the GCN module discovers the implicit connections between attackers through high-order neighbor aggregation. The cross-attention mechanism dynamically fuses these three complementary signals to form a more robust discriminative basis than single-view methods. Failure cases mainly occur at the extremely small attack scale (1%) and a low filling rate (3%). At this time, the attack signal is too weak, and it is difficult for the feature extraction of the three views to obtain sufficient discriminative information. This is similar to methods like CoDetector and DGA-MFCA based on statistical features, which face the challenge of signal-to-noise ratios under sparse attacks. The limitations of MVCA in this situation are particularly obvious: when the number of attackers is very small and the rating behavior is highly dispersed, the cross-attention mechanism lacks sufficient “evidence” to establish effective associations between views, the attention weights tend to be uniformly distributed, and the model degenerates into simple feature concatenation, resulting in a significant decline in discriminative ability.
According to
Figure 7, in the detection of bandwagon attacks, MVCA demonstrates highest robustness, maintaining an F1-score above 0.92 across all attack scales, significantly outperforming methods such as the ITRN and CNN-LSTM. Although the ITRN constructs abnormal intervals through the use of second-order differences to build a time cube, its shallow concatenation fusion strategy fails to establish a nonlinear correlation between temporal features and graph structure features, resulting in a large number of false negatives when attackers imitate normal users to give high ratings to popular projects. CNN-LSTM, in contrast, lacks a global structural perspective and thus has difficulty identifying the group behavior patterns where attackers deliberately establish connections with popular projects to enhance credibility. The key to the success of MVCA lies in the following: in the GCN view, the popularity attackers form a clear structural aggregation; in the GRU view, they show temporal synchrony; in the CNN view, they present specific local rating patterns. The cross-attention mechanism effectively correlates these cross-view cues through bidirectional interaction; for example, when the GCN detects a user group that is closely connected in the graph structure, the attention mechanism will enhance the temporal anomaly weight of this group in the GRU view. This dynamic collaboration is something that simple feature concatenation methods (such as DGA-MFCA) cannot achieve. However, MVCA has specific limitations in such attacks: when attackers adopt the “slow popularity attack” strategy, that is, deliberately extending the attack time window, dispersing the rating moments to avoid the temporal detection of GRU, and controlling the number of connections with popular projects to reduce the structural significance in the GCN, it is difficult for the cross-attention mechanism to establish strong correlations between views, and the model may mistakenly identify the attackers as real user groups with normal preferences for popular projects. Moreover, if the system itself has a large number of real enthusiasts of popular projects, their structural aggregation and temporal active characteristics are highly similar to those of attackers, and the MVCA faces a higher risk of false positives.
According to
Figure 8, in the average attack detection scenario, MVCA faces the greatest challenge, but it still outperforms all baseline methods overall. The underlying reason for this challenge lies in the “perfect camouflage” characteristic of the average attack: attackers set the filled item ratings close to the item average, making detection methods based on statistical anomalies (such as matrix decomposition in CoDetector and Bayesian inference in BayesDetector) almost ineffective; at the same time, the decentralized time strategy weakens the detection capabilities of pure sequence models (such as CNN-LSTM); and imitating the rating patterns of normal users also makes CNN methods relying solely on local features difficult to distinguish. In this scenario, the success of MVCA over other methods mainly relies on the high-order structural information of the GCN view: when the attack scale reaches more than 5%, the attacker forms a synergy on the target item to improve the effect, thereby generating detectable community structures in the user–item interaction graph. At this time, the cross-attention mechanism can automatically reduce its reliance on the CNN and GRU views and enhance the weight of the GCN view, achieving adaptive feature selection. In contrast, USG-SAD, although using a GCN, lacks deep interaction with sequence features and cannot utilize time information to assist in verifying structural anomalies; GraphRFI combines the GCN with random forests, but its static cascading design limits the information flow between views. The failure cases of MVCA are concentrated in small-scale and low-fill-rate combinations because the rating behavior of the average attacker is highly similar to that of real users, and when the number of attackers is small, it is difficult to form significant clustering in the graph structure, resulting in the simultaneous limitation of the discriminative ability of the three views. Here, the fundamental limitation of MVCA is clearly exposed: When the attacker carefully designs to simultaneously evade the detection of all three views—that is, normal local rating distribution, dispersed time behavior, and sparse graph structure connection—the cross-attention mechanism falls into the predicament because no single view can provide reliable query clues to guide the feature selection of other views. In this case, even by increasing the number of attention heads or adjusting the network depth, it is difficult to break through the discriminative limit at the information theory level.
Overall, the advantage of MVCA lies in multi-view fusion. However, when a certain type of attack causes a sharp decline in the discriminative power of one or two views, the model performance will be compared. Although the cross-attention mechanism attempts to dynamically adjust the weights, the absence or weakening of the core information source inevitably leads to the model’s overall sensitivity to this type of attack being lower than that of other types, thereby causing greater performance fluctuations. For instance, in terms of average attacks, attackers greatly downplay the anomalies in local rating patterns and temporal behaviors by imitating the average ratings of normal users. The model mainly relies on GCN views to discover the “graph structure evidence” of collaborative attacks. Only when the scale of the attack reaches a certain level and these attackers form a sufficiently tight and identifiable cluster in the graph structure can the GCN module provide decisive features and the model performance experience a leap. Before this, the performance improvement is relatively gentle.
Moreover, in this study, we utilized t-SNE, a powerful technique proposed by Hinton [
33], to visualize high-dimensional implicit features and provide an intuitive understanding of the distribution of original user features.
Figure 8 presents a 3D t-SNE visualization that effectively demonstrates the separation between genuine users and malicious attackers within our MVCA framework. The visualization employs a dual-marker scheme, with the blue spheres representing authentic users and the red tetrahedrons denoting attackers.
Notably, while attackers are distributed among legitimate users, they form distinct clusters with characteristic dispersion patterns, reflecting variations in their attack methodologies. These distinct clusters and dispersion patterns are crucial evidence of the effectiveness of our MVCA framework in identifying different types of attackers. The framework can capture the subtle differences in the high-dimensional feature space, which are not easily discernible through conventional methods.
The substantial overlap observed between certain attacker subgroups and genuine user populations suggests sophisticated behavioral emulation strategies employed by attackers. This overlap highlights cases where attackers successfully mimic normal user behavior patterns, thereby evading conventional detection mechanisms. However, despite this overlap, our MVCA framework is still able to identify these attackers by leveraging the unique dispersion patterns and subtle differences in their feature distributions. This demonstrates the robustness and effectiveness of our method in detecting malicious activities even in the presence of sophisticated attacks.
In summary,
Figure 9 provides a clear and intuitive visualization of the separation between genuine users and malicious attackers, showcasing the effectiveness of our MVCA framework in identifying and distinguishing different types of attackers based on their high-dimensional implicit features.
4.4. Optimization of Model Hyperparameters and Setting of Key Parameters
To achieve the optimal performance of MVCA, cross-validation experiments were first conducted, and some of them were optimized. Critical parameters exist in a large number of hyperparameters. After all the hyperparameters were determined, we trained the model using a combination and evaluated it on two benchmark test sets. The parameter settings of the proposed method used in the following experiments were as shown in
Table 5.
When attackers carry out co-access injection attacks, they frequently click to pop up the target items, thereby increasing the exposure of the target item. Fake users derived from the same group often launch attacks on the same target items, thereby inducing the abnormality of connectivity in the user–project interaction graph. To obtain the connected features in such interaction graphs, we achieve modeling by superimposing the embedding propagation layer of multi-layer graph convolution. Among them, the influence of the number of propagation layers (
k) in the GCN is the key hyperparameter of the model. To explore the influence of this parameter on the experimental results, in this study, under the condition of fixing the other parameters, the embedding propagation depth (
k) was set to 1, 2, 3, and 4 to conduct parameter optimization experiments.
Figure 10 shows the influence of the number of propagation layers (
k) on the classification performance. Experiments show that increasing the embedding propagation substantially enhances the model’s efficacy. When
k = 3, the overall performance of all indicators is the best, indicating that the third-order embedding propagation can effectively reflect the interaction behavior information between users and items. However, compared with
k = 4, some evaluation indicators of
k = 3 decreased to a certain extent. The reason is that the overly deep network structure may introduce noise, thereby leading to overfitting and reduced model effect. In general, a three-layer embedded propagation structure is sufficient to effectively obtain interaction information and identify abnormal behavioral patterns.