1. Introduction
Recommender systems have become a foundational component of modern digital ecosystems, shaping user experiences across e-commerce platforms, streaming services, social networks, and online marketplaces [
1]. By leveraging historical user–item interaction data, these systems aim to provide personalized recommendations that enhance engagement, satisfaction, and platform retention. As recommender systems increasingly influence user decisions and market visibility, their robustness and trustworthiness have become critical concerns.
However, the widespread adoption of recommender systems has also exposed them to adversarial manipulation. One of the most extensively studied threats is the presence of malicious or strategic users who attempt to influence recommendation outcomes by injecting biased interactions, commonly referred to as shilling or profile injection attacks [
2,
3]. Such attacks can distort ranking algorithms, unfairly promote specific items, and degrade the reliability of collaborative filtering models. Early empirical studies demonstrated that even a small fraction of malicious profiles can significantly alter recommendation outputs, particularly in sparse data environments [
2,
4]. These findings established the vulnerability of collaborative recommender systems and motivated the development of defensive mechanisms.
Traditional countermeasures against shilling attacks have primarily relied on rule-based heuristics or supervised classification approaches that assume the availability of labeled attack data [
4]. While these methods can be effective in controlled experimental settings, they face substantial limitations in real-world deployments. Explicit annotations identifying malicious users are rarely available, attack strategies evolve over time, and adversarial behavior often overlaps with legitimate but atypical user activity [
3]. These challenges highlight the difficulty of maintaining static defense mechanisms in dynamic recommendation environments.
In response, anomaly detection has emerged as a promising paradigm for recommender system security [
5,
6]. Rather than modeling predefined attack classes, anomaly detection approaches conceptualize malicious behavior as deviations from typical interaction patterns. Tree-based, kernel-based, and neural network-based unsupervised models have demonstrated the ability to capture irregular rating distributions, abnormal temporal activity, and unusual interaction patterns without requiring labeled training data [
5]. Nevertheless, much of the existing literature evaluates detection techniques in isolation or under synthetic attack injection scenarios, limiting understanding of how heterogeneous detection paradigms compare and interact under consistent feature representations.
Despite the availability of numerous anomaly detection techniques, there is a lack of standardized evaluation frameworks that enable consistent comparison across models under realistic constraints, such as the absence of labeled attack data. In practice, recommender systems must operate under label-scarce conditions, where both the nature and prevalence of adversarial behavior are uncertain. This gap motivates the need for systematic and reproducible evaluation methodologies that go beyond isolated model performance and instead assess detection behavior across multiple dimensions.
A further challenge arises from the absence of ground-truth labels. In unsupervised settings, traditional performance metrics such as precision and recall cannot be computed directly. Researchers must instead rely on indirect evaluation indicators, including score distributions, threshold stability, and inter-model agreement [
3]. This methodological constraint underscores the importance of comparative and ensemble-based analyses to assess robustness and reduce model-specific biases in label-free environments.
Motivated by these challenges, this study conducts a systematic comparative analysis of unsupervised machine learning and deep learning techniques for anomaly detection in recommender systems. Using the MovieLens 1M dataset as a representative benchmark, we construct a user-level behavioral representation derived from statistical, temporal, and interaction-based features. We evaluate three widely adopted detection paradigms—Isolation Forest, One-Class Support Vector Machine, and an autoencoder-based neural network—and investigate their individual and collective behavior through an ensemble scoring strategy.
The contributions of this work are threefold. First, we provide a unified experimental framework for comparing classical machine learning and deep learning-based unsupervised anomaly detection models under a consistent behavioral feature representation. Second, we introduce a comprehensive label-free evaluation protocol that combines score distribution analysis, percentile-based thresholding, ranking stability, and inter-model agreement. Third, we complement this evaluation with controlled experiments using synthetic attack profiles to systematically assess detection behavior across different adversarial strategies. Rather than proposing a novel detection algorithm, this study aims to provide a rigorous and reproducible comparative analysis that bridges methodological gaps in the evaluation of anomaly detection approaches for recommender systems.
The remainder of this paper is organized as follows.
Section 2 reviews related work on recommender system security and anomaly detection.
Section 3 describes the materials and methods, including the dataset, preprocessing strategy, feature engineering pipeline, and unsupervised detection models.
Section 4 presents the experimental setup, including hyperparameter configuration and label-free evaluation protocols.
Section 5 reports the experimental results, while
Section 6 discusses their implications.
Section 7 concludes the paper and outlines directions for future research.
2. Related Work
Research on security in recommender systems has primarily focused on the detection and mitigation of shilling or profile injection attacks, where malicious users attempt to manipulate recommendation outcomes by injecting biased ratings. Early foundational work by Lam and Riedl [
2] demonstrated that collaborative filtering systems are vulnerable even to relatively small numbers of injected profiles. Subsequent studies expanded on attack modeling and defensive strategies, including feature-based classification approaches for identifying malicious users [
3,
4]. These works established the fundamental taxonomy of attack models and defensive mechanisms in collaborative recommender systems.
Several surveys have systematically analyzed vulnerabilities and countermeasures in recommender systems. Gunes et al. [
7] provided a comprehensive overview of shilling attacks and detection strategies, highlighting the limitations of rule-based and heuristic methods. More recent studies have proposed increasingly sophisticated shilling attack strategies, including graph convolution-based generative models designed to bypass traditional detection mechanisms [
8]. These developments illustrate the evolving nature of adversarial behavior and the difficulty of maintaining robust detection mechanisms in dynamic environments.
Traditional attack detection methods often rely on supervised learning frameworks, assuming the availability of labeled attack data [
4]. However, in real-world deployments, explicit ground-truth labels identifying malicious users are rarely accessible. Moreover, legitimate users may exhibit behavioral patterns similar to attack profiles, further complicating classification. This limitation has motivated the exploration of anomaly detection approaches that treat malicious behavior as deviations from normal user activity rather than predefined classes.
Anomaly detection in recommender systems has been explored using statistical and machine learning methods. Yang et al. [
9] proposed detecting abnormal user behavior through feature-based modeling of rating patterns. More broadly, unsupervised anomaly detection techniques have been extensively studied in the data mining literature [
10,
11], including tree-based methods such as Isolation Forest [
12], kernel-based approaches such as One-Class Support Vector Machines [
13], and neural network models such as autoencoders [
14,
15]. While these models have demonstrated effectiveness in general anomaly detection tasks, comparative evaluations within recommender system contexts remain limited.
Recent advances have also considered ensemble-based detection strategies to improve robustness. Ensemble approaches aim to aggregate heterogeneous anomaly signals and mitigate model-specific biases [
16]. However, most existing studies either focus on synthetic attack injection scenarios or evaluate a single detection paradigm in isolation. Comparative analyses of classical machine learning and deep learning approaches under a unified behavioral representation remain relatively scarce.
In contrast to prior work that assumes labeled attacks or synthetic injection experiments, this study adopts a purely unsupervised perspective and evaluates multiple detection paradigms under a consistent feature extraction framework. By analyzing model agreement, threshold stability, and ensemble integration without relying on ground-truth labels, this work contributes to bridging the gap between theoretical attack modeling and practical anomaly detection in real-world recommender system settings.
While ensemble-based anomaly detection approaches have been explored in several machine learning contexts, many existing studies rely on supervised or semi-supervised settings where labeled attack profiles are available. In recommender system environments, however, such labeled data are rarely accessible due to the evolving nature of adversarial strategies and the difficulty of identifying malicious users with certainty.
The framework proposed in this study differs from prior ensemble-based approaches in three important aspects. First, it adopts a fully unsupervised perspective, integrating heterogeneous anomaly detection paradigms—including tree-based, kernel-based, and neural models—without requiring labeled attack data. Second, the proposed approach relies on a unified behavioral representation derived exclusively from rating interactions, combining statistical, temporal, and interaction-based signals that can be computed directly from standard recommender system logs. Third, the evaluation methodology emphasizes label-free validation indicators, including anomaly score distributions, percentile-based threshold stability, and cross-model agreement. Together, these elements provide a practical and interpretable framework for anomaly detection in recommender systems operating under realistic data constraints.
4. Experimental Setup
All experiments were conducted on the MovieLens 1M dataset using a user-level representation derived from explicit rating interactions. Following the preprocessing and filtering steps described in the previous section, each user was represented by a fixed-length feature vector capturing rating statistics, temporal activity patterns, and item popularity characteristics.
The experimental procedure followed the methodological framework described in the previous section. All detection models were applied to the standardized user-level feature representation derived from the MovieLens 1M dataset.
Three unsupervised detection models were evaluated: Isolation Forest, One-Class Support Vector Machine, and an autoencoder-based neural network. All models were trained on the same standardized feature set to ensure comparability across methods. Feature standardization was applied using z-score normalization, which is particularly important for distance- and kernel-based models such as One-Class SVM, as well as for neural network optimization [
11].
5. Results
5.6. Detection Performance Across Attack Types
To complement the label-free evaluation presented above, controlled experiments using the synthetic attack profiles described in
Section 4.2 were conducted.
These experiments aim to evaluate the ability of the proposed framework to identify anomalous user behavior under different adversarial strategies.
Three commonly studied attack strategies in recommender systems were simulated: random, average, and bandwagon attacks.
In the random attack, filler ratings are assigned randomly across items, producing profiles that deviate strongly from typical user behavior. The average attack generates ratings that approximate the average rating of each item, making malicious profiles resemble normal users more closely. In the bandwagon attack, injected users rate a set of highly popular items in addition to target items in order to camouflage their behavior within the natural popularity structure of the system.
For each attack strategy, a one-vs-rest evaluation was performed in which injected users were treated as the positive class and all remaining users were treated as negatives. Detection performance was evaluated using ROC-AUC and PR-AUC metrics.
Table 5 summarizes the ROC-AUC scores obtained for each attack type across the evaluated detection models. The results reveal that detection difficulty varies substantially depending on the attack strategy.
Random attacks are the easiest to detect, with Isolation Forest achieving the highest ROC-AUC score (0.936). This behavior is expected since random attacks produce rating patterns that strongly deviate from the behavioral distributions observed in genuine users.
Bandwagon attacks exhibit intermediate detection difficulty. Although attackers attempt to camouflage their profiles by interacting with popular items, deviations in temporal activity patterns and interaction statistics still allow anomaly detection models to identify suspicious users. In this scenario, the One-Class SVM achieves the highest ROC-AUC score (0.903).
In contrast, average attacks are the most challenging to detect. These attacks deliberately mimic the global rating distribution of the dataset, making injected profiles more similar to genuine users. Nevertheless, the behavioral features used in the proposed framework still enable meaningful discrimination, with Isolation Forest achieving a ROC-AUC score of 0.876.
These results highlight that the detectability of malicious behavior depends not only on the anomaly detection algorithm but also on the behavioral characteristics of the attack strategy. Attacks that produce strong deviations from normal interaction patterns, such as random attacks, are detected more easily. In contrast, attacks designed to mimic the statistical properties of genuine user behavior, such as average attacks, present greater detection difficulty.
Overall, the results demonstrate that the proposed behavioral feature framework is capable of identifying anomalous interaction patterns across multiple adversarial strategies. This controlled evaluation complements the label-free analyses presented earlier and provides additional empirical evidence supporting the robustness of the proposed anomaly detection approach.
It is important to note that the experimental setup assumes a fixed proportion of injected attack profiles relative to genuine users. While this controlled ratio facilitates consistent comparison across attack types, detection performance may vary under different adversarial densities. In particular, higher attack injection rates may increase detectability due to stronger deviations in global behavioral distributions, whereas lower ratios may result in more subtle anomalies that are harder to distinguish from natural user variability. Nevertheless, the relative performance trends observed across models are expected to remain consistent, as they primarily reflect the structural characteristics of each attack strategy rather than the absolute proportion of adversarial profiles.
Author Contributions
Conceptualization, R.B.; methodology, R.B. and M.O.; software, R.H. and M.A.-A.; validation, R.B., M.O., R.H. and M.A.-A.; formal analysis, R.B.; investigation, R.B., R.H. and M.A.-A.; data curation, R.B.; writing—original draft preparation, R.B.; writing—review and editing, R.B., R.H. and M.A.-A. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Acknowledgments
During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5.3) for the purposes of language refinement, including improving clarity, grammar, and readability. The authors have reviewed and edited the output and take full responsibility for the content of this publication.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Chalco, C.I.; Chasi, R.B.; Ortiz, R.H. Hierarchical Clustering for Collaborative Filtering Recommender Systems. In Proceedings of the Advances in Artificial Intelligence, Software and Systems Engineering; Ahram, T.Z., Ed.; Springer International Publishing: Cham, Switzerland, 2019; pp. 346–356. [Google Scholar]
- Lam, S.K.; Riedl, J. Shilling recommender systems for fun and profit. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, 17–20 May 2004; pp. 393–402. [Google Scholar] [CrossRef]
- Mobasher, B.; Burke, R.; Bhaumik, R.; Williams, C. Toward trustworthy recommender systems: An analysis of attack models and algorithm robustness. ACM Trans. Internet Technol. 2007, 7, 23-es. [Google Scholar] [CrossRef]
- Burke, R.; Mobasher, B.; Williams, C.; Bhaumik, R. Classification features for attack detection in collaborative recommender systems. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadephia, PA, USA, 20–23 August 2006; pp. 542–547. [Google Scholar] [CrossRef]
- Zhang, K.; Cao, Q.; Sun, F.; Wu, Y.; Tao, S.; Shen, H.; Cheng, X. Robust Recommender System: A Survey and Future Directions. ACM Comput. Surv. 2025, 58, 1–38. [Google Scholar] [CrossRef]
- Rahmatikargar, B.; Zadeh, P.M.; Kobti, Z. Enhancing Recommender Systems with Anomaly Detection: A Graph Neural Network Approach. In Proceedings of the Complex Networks & Their Applications XIII; Cherifi, H., Donduran, M., Rocha, L.M., Cherifi, C., Varol, O., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 16–28. [Google Scholar]
- Gunes, I.I.; Kaleli, C.; Bilge, A.; Polat, H. Shilling Attacks against Recommender Systems: A Comprehensive Survey. Artif. Intell. Rev. 2014, 42, 767–799. [Google Scholar] [CrossRef]
- Wu, F.; Gao, M.; Yu, J.; Wang, Z.; Liu, K.; Wang, X. Ready for emerging threats to recommender systems? A graph convolution-based generative shilling attack. Inf. Sci. 2021, 578, 683–701. [Google Scholar] [CrossRef]
- Yang, Z.; Xie, Y.; Zeng, Y.; Zhang, Z.; Yang, J. Detecting abnormal profiles in collaborative filtering recommender systems. J. Intell. Inf. Syst. 2016, 47, 211–234. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Aggarwal, C.C. An Introduction to Outlier Analysis; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–34. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
- Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
- Chalapathy, R.; Chawla, S. Deep Learning for Anomaly Detection: A Survey. arXiv 2019, arXiv:1901.03407. [Google Scholar] [CrossRef]
- Guilherme, O.C.; Arthur, Z.; Jörg, S.; Ricardo, J.G.B.C.; Barbora, M.; Erich, S.; Ira, A.; Michael, E.H. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
- Harper, F.M.; Konstan, J.A. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 2015, 5, 1–19. [Google Scholar] [CrossRef] [PubMed]
- Koren, Y.; Bell, R.; Volinsky, C. Matrix Factorization Techniques for Recommender Systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
- Ricci, F.; Rokach, L.; Shapira, B. Recommender Systems Handbook; Springer: New York, NY, USA, 2015. [Google Scholar]
- Aggarwal, C.C. Recommender Systems: The Textbook; Computer Science; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Figure 1.
Distribution of ensemble anomaly scores across users in the MovieLens 1M dataset. The heavy-tailed distribution indicates that the majority of users exhibit typical behavioral patterns with low anomaly scores, while a small subset of users forms an upper tail with substantially higher scores. This separation suggests that the proposed framework is able to identify potentially deviant user profiles without relying on labeled attack data.
Figure 2.
Ensemble anomaly score distribution with percentile-based thresholds corresponding to different alerting rates. The smooth progression of threshold values indicates stable ranking behavior across operating points, supporting the use of percentile-based thresholds as a practical mechanism for anomaly monitoring in label-free settings. Vertical lines indicate the corresponding threshold values used to flag anomalous users.
Figure 3.
Relationship between extreme rating behavior and ensemble anomaly scores. Users with higher anomaly scores tend to exhibit a larger proportion of extreme ratings (e.g., minimum or maximum values), suggesting that unusual rating intensity may contribute to the identification of anomalous behavioral patterns.
Figure 4.
Relationship between temporal burstiness and ensemble anomaly scores. Profiles with highly concentrated rating activity within short time intervals tend to receive higher anomaly scores, indicating that temporal burst patterns are a relevant behavioral signal for detecting suspicious user activity.
Figure 5.
Top behavioral features ranked by Gini importance using a supervised proxy model trained on synthetic attack labels. Interaction intensity, temporal activity patterns, and item popularity statistics emerge as the most influential signals for distinguishing anomalous user profiles.
Figure 6.
Top behavioral features ranked by Spearman correlation with the ensemble anomaly score. Temporal burstiness and interaction frequency features show the strongest relationships with anomalous behavior.
Figure 7.
Distribution of anomaly scores produced by mean and max ensemble aggregation strategies on the original dataset. Mean aggregation produces a smoother score distribution and a more stable anomaly ranking compared to max aggregation.
Figure 8.
Comparison of detection performance across models using synthetic attack experiments. Ensemble aggregation achieves strong and stable performance across different detection paradigms.
Table 1.
Hyperparameter settings for the unsupervised detection models.
| Model | Hyperparameter | Value |
|---|
| Isolation Forest | Number of trees () | 300 |
| Subsample size () | auto |
| Contamination rate | 0.02 |
| Parallel jobs () | −1 |
| Random seed | 42 |
| One-Class SVM | Kernel | RBF |
| (upper bound on anomalies) | 0.02 |
| scale |
| Shrinking | True |
| Autoencoder | Encoder layers | [, 64, 16] |
| Decoder layers | [16, 64, ] |
| Activation function | ReLU |
| Loss function | MSE |
| Optimizer | Adam |
| Learning rate | |
| Training epochs/batch size | 25/256 |
Table 2.
Percentile-based anomaly score thresholds using the ensemble model.
| Top Percentile (%) | Threshold Value | Flagged Users |
|---|
| 1 | 0.4617 | 61 |
| 2 | 0.4149 | 121 |
| 5 | 0.3477 | 302 |
Table 3.
Agreement between unsupervised detection models measured by Jaccard similarity on the top-K ranked users.
| Model Pair | Jaccard Similarity |
|---|
| Isolation Forest vs. One-Class SVM | 0.282 |
| Isolation Forest vs. Autoencoder | 0.163 |
| One-Class SVM vs. Autoencoder | 0.282 |
| Ensemble vs. Isolation Forest | 0.471 |
| Ensemble vs. One-Class SVM | 0.449 |
| Ensemble vs. Autoencoder | 0.408 |
Table 4.
Spearman rank correlation between anomaly score rankings across detection models.
| Model Pair | Spearman |
|---|
| Isolation Forest vs. Ensemble Mean | 0.863 |
| One-Class SVM vs. Ensemble Mean | 0.778 |
| Autoencoder vs. Ensemble Mean | 0.662 |
| Isolation Forest vs. One-Class SVM | 0.726 |
| Isolation Forest vs. Autoencoder | 0.601 |
| One-Class SVM vs. Autoencoder | 0.589 |
Table 5.
ROC-AUC performance across different attack types.
| Model | Random | Bandwagon | Average |
|---|
| Isolation Forest | 0.936 | 0.753 | 0.876 |
| Ensemble Mean | 0.927 | 0.863 | 0.860 |
| Ensemble Weighted (Oracle) | 0.929 | 0.860 | 0.864 |
| Ensemble Max | 0.922 | 0.842 | 0.844 |
| One-Class SVM | 0.877 | 0.903 | 0.794 |
| Autoencoder | 0.658 | 0.859 | 0.452 |
Table 6.
Top behavioral features most strongly correlated with the ensemble anomaly score (Spearman correlation).
| Feature | Spearman |
|---|
| burst_ratio_10 min | −0.262 |
| delta_mean_s | 0.192 |
| ratings_per_day | −0.189 |
| mean_item_pop | −0.181 |
| min_item_pop | −0.172 |
| profile_span_s | 0.154 |
| std_rating | −0.146 |
| rating_entropy | −0.139 |
| num_ratings | −0.133 |
| burst_ratio_1 h | −0.127 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |