Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Matrix Factorization-Based Clustering for Sparse Data in Recommender Systems: A Comparative Study

Computation 2025, 13(9), 213; https://doi.org/10.3390/computation13090213

by Rodolfo Bojorque^1,2,*,†

and Remigio Hurtado¹

Reviewer 1:

Vladimir Jotsov

Reviewer 2:

Jie Hua

Reviewer 3: Anonymous

Computation 2025, 13(9), 213; https://doi.org/10.3390/computation13090213

Submission received: 28 July 2025 / Revised: 26 August 2025 / Accepted: 1 September 2025 / Published: 3 September 2025

(This article belongs to the Section Computational Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The education element of the proposed manuscript is high.

On the other hand, few parts should be improved.

The choice of the dataset, and operations with this dataset should be better explained. No other datasets had been applied. The only reason for its choice was its usage in a foregn work.
The mechanical division into 20 parts should be a very negative factor. The processing time, and other important parameters are not considered.
No real RS applications have been displayed.
The combined applications of different hard and soft clustering approaches are insufficiently described.
Concerning the manual work for preprocessing and clustering itself the provided information is insufficient.
Why MAE from fig. 2 is so low? Is it because of an unstructured dataset problem or the reasons are different?
Big data problems mentioned in lines 349-350 are insufficiently experimented. Same for data sparsity

Author Response

Comment 1: The choice of the dataset, and operations with this dataset should be better explained. No other datasets had been applied. The only reason for its choice was its usage in a foreign work.
Response 1: Thank you for pointing this out. We agree with this comment. We have expanded the justification for the dataset choice in the Materials and Methods section. In the revised manuscript (Section Materials and Methods, paragraph 1), we now clarify that the MovieLens 1M dataset was selected because it balances representativeness, reproducibility, and computational tractability when evaluating clustering metrics in recommender systems. Unlike smaller datasets, it provides sufficient sparsity and diversity of user–item interactions, while larger datasets often hinder fair clustering comparison due to scalability concerns.
We have also noted that our study prioritizes intra-cluster cohesion and MAE metrics, which are commonly benchmarked on MovieLens 1M to ensure reproducibility across studies. Additionally, we refer to recent discussions (Di Palma et al., 2025) on dataset memorization in LLM-based RS research to reinforce that MovieLens remains a transparent, reproducible benchmark.
[Updated manuscript text (Section Materials and Methods, paragraph 1):]
"In this study, we selected the widely recognized MovieLens 1M dataset as the experimental benchmark. This dataset was chosen not merely because of prior usage in earlier works, but because it provides an optimal balance of representativeness, reproducibility, and sparsity, while remaining computationally tractable for clustering-based experiments. Larger-scale datasets tend to complicate fair comparison due to scalability constraints, whereas MovieLens 1M ensures reproducibility across RS literature. Moreover, recent discussions highlight the importance of reproducible datasets in avoiding memorization biases in recommender system evaluations."

Comment 2: The mechanical division into 20 parts should be a very negative factor. The processing time, and other important parameters are not considered.
Response 2: Thank you for this observation. We agree that clarification is needed. The division of the dataset into 20 randomly generated subsets was not arbitrary, but intended to ensure robustness and reduce sampling bias. We now emphasize this rationale in the revised manuscript (Subsection Data Preparation, paragraph 1). Additionally, we agree that runtime benchmarking and computational complexity analysis would provide additional value. However, the main focus of this study was on clustering quality rather than execution efficiency. It is well established that matrix factorization-based techniques involve inherently higher computational complexity than classical methods, since they require both dimensionality reduction and subsequent clustering steps. While we did not conduct a detailed runtime comparison, we note in the revised manuscript (Section Discussion, new paragraph) that Bayesian NMF remained computationally feasible, with convergence typically achieved within practical time frames. A more systematic benchmarking of runtime and scalability will be addressed in future work..
[Updated manuscript text (Subsection Data Preparation, paragraph 1):]
"To mitigate potential sampling bias, the MovieLens 1M dataset was divided into 20 randomly generated subsets. This stratified partitioning enhances robustness by ensuring that results are not overly dependent on a single train-test split, providing more generalizable insights. The procedure does not negatively bias clustering performance but strengthens statistical reliability."
[Updated manuscript text (section Discussion, new paragraph):]
"The scope of this study was limited to clustering quality. Although a detailed runtime analysis was not included, it is evident that matrix factorization-based approaches entail higher computational complexity than traditional algorithms, as they require dimensionality reduction followed by clustering. In our experiments, Bayesian NMF remained computationally feasible, converging within practical time frames. Future work will explicitly benchmark execution time and scalability on larger datasets."

Comment 3: No real RS applications have been displayed.
Response 3: Thank you for pointing this out. We acknowledge this limitation. Our study is focused on the methodological and comparative evaluation of clustering algorithms, rather than on a specific industrial recommender deployment. To address this concern, we have added a new section in the Discussion (section Discussion, new paragraph) highlighting possible real-world applications in e-commerce, streaming platforms, and educational RS.
[Updated manuscript text (section Discussion, new paragraph):]
"While this research prioritizes methodological benchmarking, the implications are directly applicable to real RS applications: In e-commerce, Bayesian NMF clustering can enhance personalization by grouping users with overlapping purchasing patterns, thus improving product discovery and cross-selling strategies. In streaming services, the ability to model users with multiple overlapping preferences allows for more accurate content recommendations in multi-genre consumption scenarios. In educational platforms, clustering students with similar learning behaviors can support adaptive learning pathways and targeted interventions. Furthermore, Bayesian NMF offers potential advantages in cold-start situations by assigning new users or items to probabilistic clusters based on partial information, thereby mitigating the lack of historical data. These applications underscore the practical relevance of probabilistic clustering in environments characterized by data sparsity and evolving user behaviors."

Comment 4: The combined applications of different hard and soft clustering approaches are insufficiently described.
Response 4: Thank you for highlighting this. We agree that this aspect needed clarification. In the revised manuscript (Subsection Impact of Parameters on Clustering Quality, paragraph 1), we now explicitly describe how Bayesian NMF integrates both hard (α = 0) and soft (α > 0) clustering regimes, and why this provides flexibility over classical methods such as K-means (hard only) or FCM (soft only).
[Updated manuscript text (Subsection Impact of Parameters on Clustering Quality, paragraph 1):]
"A key advantage of Bayesian NMF lies in its dual capacity for both hard and soft clustering within the same probabilistic framework. By adjusting α, the model seamlessly transitions between exclusive partitions (hard clustering) and overlapping clusters (soft clustering), enabling more nuanced representations of user preferences compared to classical algorithms restricted to one regime."

Comment 5: Concerning the manual work for preprocessing and clustering itself the provided information is insufficient.
Response 5: Thank you for raising this point. We respectfully disagree with the assumption that preprocessing details are insufficient. In the revised manuscript (Subsection Data Preparation, new paragraph), we clarify that the preprocessing and clustering pipeline was not manual but followed standardized and reproducible steps consistent with Hernando et al. (2016). This ensures that results can be replicated across studies and aligns our methodology with established benchmarks in the field.
[Updated manuscript text (Subsection Data Preparation, new paragraph):]
"The preprocessing pipeline was implemented following the methodology of Hernando, which provides a reproducible framework for handling sparsity in user–item matrices. The implementation of Bayesian NMF was developed using the CF4J framework (https://cf4j.etsisi.upm.es/), which offers reproducible libraries for collaborative filtering and matrix factorization algorithms. Building on this framework ensured consistency with prior works and reproducibility of the experimental setup."

Comment 6: Why MAE from fig. 2 is so low? Is it because of an unstructured dataset problem or the reasons are different?
Response 6: Thank you for raising this important point. We agree clarification is required. The relatively low MAE observed in Figure 2 is not due to dataset issues, but rather to the effective tuning of Bayesian NMF parameters (particularly β > 40) which enforces stricter evidence thresholds, thereby improving predictive accuracy. We have added an explanatory note in the Results section (subsection Soft Clustering Analysis, new paragraph).
[Updated manuscript text (subsection Soft Clustering Analysis, new paragraph):]
"The low MAE values observed in Figure 2 are attributable to Bayesian NMF’s probabilistic modeling and the use of high evidence thresholds (β > 40), which enforce more reliable membership assignments. This leads to improved predictive performance and should not be interpreted as a limitation of the dataset structure."

Comment 7: Big data problems mentioned in lines 349–350 are insufficiently experimented. Same for data sparsity.
Response 7: Thank you for this comment. We agree that these points required elaboration. In the revised Discussion (section Discussion, new paragraph), we now expand the implications of scalability and sparsity. While our experiments were limited to MovieLens 1M, we acknowledge that future work should extend BNMF clustering to larger-scale datasets (e.g., Netflix Prize, Amazon Reviews). We also cite recent works on multi-view clustering (Gao et al., 2025; Orme et al., 2025) as potential directions to address big data and extreme sparsity.
[Updated manuscript text (section Discussion, new paragraph):]
"Although this study employed MovieLens 1M as a controlled benchmark, we recognize that scalability and extreme sparsity remain open challenges in big data environments. Future research should extend Bayesian NMF to larger datasets such as Netflix Prize or Amazon Reviews and integrate recent approaches like dynamic multi-view clustering or biclustering with ResNMTF, which are specifically designed to address sparsity in large-scale recommendation contexts."

Reviewer 2 Report

Comments and Suggestions for Authors

This study compares different clustering methods for improving recommender systems, showing that Bayesian NMF can make recommender systems more accurate and interpretable than traditional clustering, especially with the right parameter tuning in situations with sparse, high-dimensional data, like movie ratings.
Introduction and Related Works are generally well-structured and comprehensive, but they could be improved by reducing redundancy, tightening overly long sentences, and improving flow between concepts. In the Related Works, the presentation reads like a list of method descriptions without enough synthesis, categorising methods and comparing their pros/cons more directly would make it more focused.
Some other aspects are suggested.
- Consider adding cross-validation instead of only random train-test splits to reduce variance in performance estimates.
- Implement a systematic hyper-parameter tuning strategy for α, β, and fuzziness coefficient, with statistical significance testing.
- Complement intra-cluster cohesion with external validity metrics such as silhouette score and NMI, and downstream RS metrics for broader insight.
- Include computational complexity and runtime benchmarking across methods, especially for large-scale RS scenarios.
- Provide random seed control, detailed implementation settings, and versioned code dependencies to ensure exact reproducibility.
- Standardise y-axis scales across plots to improve interpretability of cohesion and accuracy trends.
- Evaluate sensitivity to dataset sparsity levels or noisy ratings to strengthen generalisability claims.

Comments on the Quality of English Language

n/a

Author Response

Comment 1: Introduction and Related Works are generally well-structured and comprehensive, but they could be improved by reducing redundancy, tightening overly long sentences, and improving flow between concepts. In the Related Works, the presentation reads like a list of method descriptions without enough synthesis, categorising methods and comparing their pros/cons more directly would make it more focused.
Response 1: Thank you for this helpful comment. We agree with the reviewer. In the revised manuscript, we have edited the Related Works section by reducing redundancy in the introduction section, shortening sentences for clarity, and restructuring the flow to provide more synthesis.

Comment 2: Consider adding cross-validation instead of only random train-test splits to reduce variance in performance estimates.
Response 2: Thank you for this valuable suggestion. We agree that k-fold cross-validation would further strengthen statistical reliability. However, in this study we opted for 20 random train-test splits to ensure comparability and reproducibility with prior works by Bobadilla and Hernando, which employed the same experimental strategy. This decision ensures that our results can be directly contrasted with established benchmarks in the literature. Nevertheless, we acknowledge that implementing cross-validation is a natural extension of this work and will be considered in future research.

Comment 3: Implement a systematic hyper-parameter tuning strategy for α, β, and fuzziness coefficient, with statistical significance testing.
Response 3: Thank you for this valuable point. We agree. In the revised manuscript (section Impact of Parameters on Clustering Quality, new paragraph), we emphasize that α and β were varied systematically across ranges, but we did not conduct formal hyper-parameter optimization or statistical testing. We have now added that future research should incorporate grid search or Bayesian optimization strategies and validate results with statistical significance testing.
[Updated manuscript text (section Impact of Parameters on Clustering Quality, new paragraph):]
"Although parameters α and β were varied systematically, future research should employ structured optimization strategies (e.g., grid search, Bayesian optimization) and statistical significance testing to establish robustness of performance differences."

Comment 4: Complement intra-cluster cohesion with external validity metrics such as silhouette score and NMI, and downstream RS metrics for broader insight.
Response 4: Thank you for this suggestion. We respectfully note that while external clustering validity metrics such as silhouette score and normalized mutual information (NMI) are widely used in general clustering contexts, they are not well-suited for recommender system datasets where sparsity typically exceeds 90%. These measures were originally designed for dense numerical matrices and tend to produce unstable or misleading results in extremely sparse high-dimensional rating matrices. For this reason, and consistent with prior works Hernando (2016), Bobadilla(2018), we prioritized intra-cluster cohesion as a reliable internal metric strongly correlated with recommendation accuracy. Nevertheless, we acknowledge the importance of complementary evaluation metrics and will explore adaptations of external indices for sparse recommender scenarios in future research.

Comment 5: Include computational complexity and runtime benchmarking across methods, especially for large-scale RS scenarios.
Response 5: Thank you for this excellent suggestion. We agree that runtime benchmarking and computational complexity analysis would provide additional value. However, the main focus of this study was on clustering quality rather than execution efficiency. It is well established that matrix factorization-based techniques involve inherently higher computational complexity than classical methods, since they require both dimensionality reduction and subsequent clustering steps. While we did not conduct a detailed runtime comparison, we note in the revised manuscript (section Discussion, new paragraph) that Bayesian NMF remained computationally feasible, with convergence typically achieved within practical time frames. A more systematic benchmarking of runtime and scalability will be addressed in future work.
[Updated manuscript text (section Discussion, new paragraph):]
"The scope of this study was limited to clustering quality. Although a detailed runtime analysis was not included, it is evident that matrix factorization-based approaches entail higher computational complexity than traditional algorithms, as they require dimensionality reduction followed by clustering. In our experiments, Bayesian NMF remained computationally feasible, converging within practical time frames. Future work will explicitly benchmark execution time and scalability on larger datasets."

Comment 6: Provide random seed control, detailed implementation settings, and versioned code dependencies to ensure exact reproducibility.
Response 6: Thank you for this important observation. In the original implementation, we omitted setting a fixed random seed because the numerical variations across runs appear only from the sixth decimal onwards, which does not affect the conclusions or relative performance comparisons among methods. Therefore, the results remain stable and reproducible in practice, even without explicit seed control.

Comment 7: Standardise y-axis scales across plots to improve interpretability of cohesion and accuracy trends.
Response 7: Thank you for this suggestion. We acknowledge that standardized axes can facilitate visual comparison. However, in this study we deliberately omitted scale normalization because all experiments were conducted on a single benchmark dataset (MovieLens 1M), and preserving the original scales ensures direct comparability with previously published baselines Hernando(2016), bobadilla(2018). This consistency allows our results to be contrasted with prior work under equivalent graphical conditions.

Comment 8: Evaluate sensitivity to dataset sparsity levels or noisy ratings to strengthen generalisability claims.
Response 8: Thank you for this insightful suggestion. We agree. In the revised manuscript (section Discussion, new paragraph), we have acknowledged this limitation and suggested as future work the evaluation of Bayesian NMF under varying sparsity levels and simulated noise.
[Updated manuscript text (section Discussion, new paragraph):]
"A limitation of this work is that we did not evaluate sensitivity to varying levels of sparsity or noisy ratings. Future experiments will systematically adjust sparsity levels and introduce synthetic noise to better assess the robustness of Bayesian NMF in real-world recommendation contexts."

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors

The paper claims to provide an in-depth comparison of Bayesian NMF-based clustering methods for recommender systems, but the topic has already been studied in the literature (e.g., Hernando et al., 2016; Bobadilla et al., 2018). Please distinctly differentiate the contributions of this paper from the prior work.

The paper is lacking in the discussion of how such results might be applied in real recommender systems (e.g., in cold-start scenarios, environments with evolving patterns, or business platforms). Have a separate section on practical implications.

One important omission is the lack of comparisons of computational efficiency between K-means, FCM, and Bayesian NMF. Insert run-time analysis or discussion of computational cost and scalability, especially given that Bayesian NMF is complicated.

The literature review lacks some of the recent research work (2020–2024) on deep-learning-based clustering or hybrid recommender systems. Please enhance the related work section with more recent advancements in the area.

An overview diagram of the methodological process (data preprocessing, clustering, evaluation) will make it easier for readers to follow the paper. Providing a high-level workflow diagram would be useful.

Though the paper employs several random splits of the dataset, the paper is devoid of statistical validation (e.g., standard deviation, confidence intervals, or significance testing). Use proper statistical measures to ensure the stability of results.

While α and β are being varied, the reasons for selecting those values are not very clear. Were these selected through grid search, cross-validation, or heuristics? More explanation should be given in that regard.

The paper contains no algorithmic steps or pseudocode for performing Bayesian NMF. Include a reproducibility appendix or give a GitHub link to code and parameters.

The work uses the MovieLens 1M data in isolation. Justify this limitation or present results on a second benchmark dataset (e.g., Netflix Prize, Jester, Amazon Reviews).

The paper does not study the effect of varying levels of sparsity on clustering performance. Think about evaluating performance across different sparsity levels.

Legend captions on MAE and intra-cluster cohesion figures are useful, but legends are not readable, and axis captions are too small. Please make the figure more readable and consistent.

Other modern clustering algorithms (i.e., spectral clustering, DBSCAN, or hierarchical clustering) are not covered. A reason should be explained as to why these methodologies were not included.

While intra-cluster cohesion is used, not mentioned are inter-cluster separation or clustering validity indices (e.g., Davies-Bouldin, Silhouette). Including such measures would further facilitate comparative analysis.

The paper has a few figures and Table 1, but the supporting discussion is minimal. Each figure/table needs further interpretation in the paper.

Clustering is typically used to relieve cold-start problems. It would be worthwhile to test or at least conceptualize how Bayesian NMF helps out with cold-start scenarios (new users or items).

Several paragraphs repeat the same strengths of Bayesian NMF (e.g., interpretability, flexibility). This can be abbreviated to avoid redundancy.

The results are not critically addressed in the discussion section. It should address potential biases, limitations, and comparisons with other methods in the literature.

Extensive use of acronyms like CMF, NMF, FCM, etc., is made without repeated mention of their full names. Check text for readability to less technical readers.

The paper can use an independent paragraph that clearly mentions its limitations (i.e., dataset size, parameter sensitivity, absence of generalization testing).

Although the conclusion is a summary of results, it doesn't include a visionary perspective. Please extend on how this work may grow (e.g., neural model integration, streaming data use cases).

To enhance reproducibility and impact, it is recommended that the authors make their implementation scripts and training/testing splits available on a public repository.

Author Response

Comment 1: The paper claims to provide an in-depth comparison of Bayesian NMF-based clustering methods for recommender systems, but the topic has already been studied in the literature (e.g., Hernando et al., 2016; Bobadilla et al., 2018). Please distinctly differentiate the contributions of this paper from the prior work.
Response 1: Thank you for this important observation. We clarify that the novelty of our study lies in employing Bayesian NMF primarily as a clustering technique rather than only as a collaborative filtering tool. Unlike Hernando et al. (2016) and Bobadilla et al. (2018), we systematically analyze the effects of α and β under clustering metrics, demonstrating BNMF as a probabilistic clustering framework with both hard and soft capabilities.

Comment 2: The paper is lacking in the discussion of how such results might be applied in real recommender systems (e.g., in cold-start scenarios, environments with evolving patterns, or business platforms). Have a separate section on practical implications.
Response 2: Thank you for this suggestion. We agree and have incorporated a paragraph in the Discussion where we describe how Bayesian NMF can enhance personalization in e-commerce, improve content discovery in streaming services, support adaptive learning in education, and address cold-start scenarios.
[Updated manuscript text – Discussion]:
"While this research prioritizes methodological benchmarking, the implications are directly applicable to real RS applications: In e-commerce, Bayesian NMF clustering can enhance personalization by grouping users with overlapping purchasing patterns, thus improving product discovery and cross-selling strategies. In streaming services, the ability to model users with multiple overlapping preferences allows for more accurate content recommendations in multi-genre consumption scenarios. In educational platforms, clustering students with similar learning behaviors can support adaptive learning pathways and targeted interventions. Furthermore, Bayesian NMF offers potential advantages in cold-start situations by assigning new users or items to probabilistic clusters based on partial information, thereby mitigating the lack of historical data. These applications underscore the practical relevance of probabilistic clustering in environments characterized by data sparsity and evolving user behaviors."

Comment 3: One important omission is the lack of comparisons of computational efficiency between K-means, FCM, and Bayesian NMF. Insert run-time analysis or discussion of computational cost and scalability, especially given that Bayesian NMF is complicated.
Response 3: Thank you for this excellent suggestion. We agree that runtime benchmarking and computational complexity analysis would provide additional value. However, the main focus of this study was on clustering quality rather than execution efficiency. It is well established that matrix factorization-based techniques involve inherently higher computational complexity than classical methods, since they require both dimensionality reduction and subsequent clustering steps. While we did not conduct a detailed runtime comparison, we note in the revised manuscript (section Discussion, new paragraph) that Bayesian NMF remained computationally feasible, with convergence typically achieved within practical time frames. A more systematic benchmarking of runtime and scalability will be addressed in future work.
[Updated manuscript text (section Discussion, new paragraph):]
"The scope of this study was limited to clustering quality. Although a detailed runtime analysis was not included, it is evident that matrix factorization-based approaches entail higher computational complexity than traditional algorithms, as they require dimensionality reduction followed by clustering. In our experiments, Bayesian NMF remained computationally feasible, converging within practical time frames. Future work will explicitly benchmark execution time and scalability on larger datasets."

Comment 4: The literature review lacks some of the recent research work (2020–2024) on deep-learning-based clustering or hybrid recommender systems. Please enhance the related work section with more recent advancements in the area.
Response 4: Thank you for this suggestion. We enriched the Related Works section by integrating recent advances (2020–2025):
Palma, F.D.; Carrière, B.; Varoquaux, G. Do LLMs Memorize Recommendation Datasets? A Case Study on MovieLens-1M, 2025, [arXiv:cs.IR/2505.10212]. 472
Gao, T.; Yu, S.; Wang, F.; Chen, B.; Shan, S.; Chen, X. Matrix Factorization with Dynamic Multi-view Clustering for Recommender System, 2025, [arXiv:cs.IR/2504.14565].
Orme, D.; Hao, Z.; Liatsis, P.; Jin, Y.; Yang, L. Multi-view Biclustering via Non-negative Matrix Tri-factorisation, 2025, [arXiv:cs.LG/2502.13698].

Comment 5: An overview diagram of the methodological process (data preprocessing, clustering, evaluation) will make it easier for readers to follow the paper. Providing a high-level workflow diagram would be useful.
Response 5: Thank you for this valuable suggestion. We agree that a workflow diagram could improve visualization. However, due to manuscript length constraints and to maintain focus on the experimental analysis, we did not include an additional figure. Conceptually, our methodology follows three straightforward phases: preprocessing, clustering, and evaluation. We believe that these phases are clearly described in the Materials and Methods section, ensuring that readers can follow the process without the need for a diagram.

Comment 6: Though the paper employs several random splits of the dataset, the paper is devoid of statistical validation (e.g., standard deviation, confidence intervals, or significance testing). Use proper statistical measures to ensure the stability of results.
Response 6: Thank you for this point. We note that results were highly stable across splits, with differences only beyond the sixth decimal. While detailed significance testing was not included, we acknowledge this as a limitation and as future work.

Comment 7: While α and β are being varied, the reasons for selecting those values are not very clear. Were these selected through grid search, cross-validation, or heuristics? More explanation should be given in that regard.
Response 7: Thank you. We clarify in the Methods that α and β ranges were selected following the heuristics established by Hernando (2016) and Bobadilla (2018) to ensure comparability with prior benchmarks.

Comment 8: The paper contains no algorithmic steps or pseudocode for performing Bayesian NMF. Include a reproducibility appendix or give a GitHub link to code and parameters.
Response 8: Thank you for this comment. We clarify that the implementation of Bayesian NMF in this work was based on the CF4J framework (https://cf4j.etsisi.upm.es/), which provides reproducible implementations for collaborative filtering and matrix factorization techniques. To enhance transparency, we have added a paragraph in the Implementation Details subsection explicitly acknowledging CF4J as the software foundation and citing it as a reference.
[Updated manuscript text section Materials and methods]:
"The implementation of Bayesian NMF was developed using the CF4J framework (https://cf4j.etsisi.upm.es/), which offers reproducible libraries for collaborative filtering and matrix factorization algorithms. Building on this framework ensured consistency with prior works and reproducibility of the experimental setup."

Comment 9: The work uses the MovieLens 1M data in isolation. Justify this limitation or present results on a second benchmark dataset (e.g., Netflix Prize, Jester, Amazon Reviews).
Response 9: Thank you. We justify our focus on MovieLens 1M as it provides a reproducible, representative, and widely used benchmark. Using this dataset allows direct comparison with prior BNMF works. Larger datasets will be addressed in future studies.

Comment 10: The paper does not study the effect of varying levels of sparsity on clustering performance. Think about evaluating performance across different sparsity levels.
Response 10: Thank you. We acknowledge this limitation and explicitly note in the Discussion that future work will explore performance under varying sparsity levels and simulated noise.
[Updated manuscript text section Discussion]:
"A limitation of this work is that we did not evaluate sensitivity to varying levels of sparsity or noisy ratings. Future experiments will systematically adjust sparsity levels and introduce synthetic noise to better assess the robustness of Bayesian NMF in real-world recommendation contexts."

Comment 11: Legend captions on MAE and intra-cluster cohesion figures are useful, but legends are not readable, and axis captions are too small. Please make the figure more readable and consistent.
Response 11: Thank you for this observation. We revised the figures to enlarge axis captions and legends, ensuring consistent readability across all plots. In addition, all graphics were generated in vectorized format, which guarantees clear visualization and scalability without quality loss.

Comment 12: Other modern clustering algorithms (i.e., spectral clustering, DBSCAN, or hierarchical clustering) are not covered. A reason should be explained as to why these methodologies were not included.
Response 12: Thank you for this comment. Our study deliberately focused on clustering approaches derived from matrix factorization, since the main objective was to analyze the latent patterns behind factorization models and interpret them as clusters. Methods such as spectral clustering, DBSCAN, or hierarchical clustering, while valuable, fall outside this scope because they do not exploit the probabilistic latent structure that Bayesian NMF provides. For this reason, and to maintain coherence with prior BNMF-based benchmarks, we restricted the comparison to k-means, FCM, and matrix factorization-based clustering.

Comment 13: While intra-cluster cohesion is used, not mentioned are inter-cluster separation or clustering validity indices (e.g., Davies-Bouldin, Silhouette). Including such measures would further facilitate comparative analysis.
Response 13: Thank you for this suggestion. We respectfully note that while external clustering validity metrics such as silhouette, Davies-Bouldin are widely used in general clustering contexts, they are not well-suited for recommender system datasets where sparsity typically exceeds 90%. These measures were originally designed for dense numerical matrices and tend to produce unstable or misleading results in extremely sparse high-dimensional rating matrices. For this reason, and consistent with prior works Hernando (2016), Bobadilla(2018), we prioritized intra-cluster cohesion as a reliable internal metric strongly correlated with recommendation accuracy. Nevertheless, we acknowledge the importance of complementary evaluation metrics and will explore adaptations of external indices for sparse recommender scenarios in future research.

Comment 14: The paper has a few figures and Table 1, but the supporting discussion is minimal. Each figure/table needs further interpretation in the paper.
Response 14: Thank you. We have expanded the Discussion and Results sections to provide more detailed interpretation of Figures 1–5 and Table 1.

Comment 15: Clustering is typically used to relieve cold-start problems. It would be worthwhile to test or at least conceptualize how Bayesian NMF helps out with cold-start scenarios (new users or items).
Response 15: Thank you. We address this in a new paragraph of the Practical Implications, noting that BNMF can mitigate cold-start by probabilistically assigning new users/items to clusters with partial information.
[Updated manuscript text section Discussion]:
"While this research prioritizes methodological benchmarking, the implications are directly applicable to real RS applications: In e-commerce, Bayesian NMF clustering can enhance personalization by grouping users with overlapping purchasing patterns, thus improving product discovery and cross-selling strategies. In streaming services, the ability to model users with multiple overlapping preferences allows for more accurate content recommendations in multi-genre consumption scenarios. In educational platforms, clustering students with similar learning behaviors can support adaptive learning pathways and targeted interventions. Furthermore, Bayesian NMF offers potential advantages in cold-start situations by assigning new users or items to probabilistic clusters based on partial information, thereby mitigating the lack of historical data. These applications underscore the practical relevance of probabilistic clustering in environments characterized by data sparsity and evolving user behaviors."

Comment 16: Several paragraphs repeat the same strengths of Bayesian NMF (e.g., interpretability, flexibility). This can be abbreviated to avoid redundancy.
Response 16: Thank you for this helpful comment. We agree with the reviewer. In the revised manuscript, we have edited the Related Works section by reducing redundancy in the introduction section, shortening sentences for clarity, and restructuring the flow to provide more synthesis.

Comment 17: The results are not critically addressed in the discussion section. It should address potential biases, limitations, and comparisons with other methods in the literature.
Response 17: Thank you. The Discussion has been expanded to address dataset dependence, biases, and comparative limitations of BNMF against classical and deep-learning methods.

Comment 18: Extensive use of acronyms like CMF, NMF, FCM, etc., is made without repeated mention of their full names. Check text for readability to less technical readers.
Response 18: Thank you. We reviewed the manuscript to periodically restate full method names alongside acronyms to improve readability.

Comment 19: The paper can use an independent paragraph that clearly mentions its limitations (i.e., dataset size, parameter sensitivity, absence of generalization testing).
Response 19: Thank you. We added a Limitations paragraph in the Discussion summarizing dataset size, parameter sensitivity, and lack of generalization testing.
[Updated manuscript text section Discussion]:
"A limitation of this work is that we did not evaluate sensitivity to varying levels of sparsity or noisy ratings. Future experiments will systematically adjust sparsity levels and introduce synthetic noise to better assess the robustness of Bayesian NMF in real-world recommendation contexts."

Comment 20: Although the conclusion is a summary of results, it doesn't include a visionary perspective. Please extend on how this work may grow (e.g., neural model integration, streaming data use cases).
Response 20: Thank you. We extended the Conclusions to include a visionary perspective, highlighting potential integration with neural recommender models and application to streaming data environments.
[Updated manuscript text section Conclusions]:
"Beyond summarizing the findings, this study opens avenues for future research.
One promising direction is the integration of Bayesian NMF with neural recommender architectures, combining interpretability with the representational power of deep learning.
Another avenue involves applying the method to streaming and real-time recommendation contexts, where user preferences evolve continuously and models must adapt dynamically.
Incorporating contextual signals such as temporal dynamics, implicit feedback, and multi-modal information will further extend the applicability of Bayesian NMF in modern digital platforms.
These perspectives highlight the potential of probabilistic clustering as a foundation for next-generation recommender systems."

Comment 21: To enhance reproducibility and impact, it is recommended that the authors make their implementation scripts and training/testing splits available on a public repository.
Response 21: Thank you. We agree and confirm that our implementation scripts and data splits will be released in a public GitHub repository for reproducibility.
[Updated manuscript text section Conclusions]:
The implementation of Bayesian NMF was developed using the CF4J framework (https://cf4j.etsisi.upm.es/), which offers reproducible libraries for collaborative filtering and matrix factorization algorithms. Building on this framework ensured consistency with prior works and reproducibility of the experimental setup.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript is significantly improved.

Concerning the answer:

>>> Our study is focused on the methodological and comparative evaluation of clustering algorithms, rather than on a specific industrial recommender deployment.

I cannot agree that only industrial research should have a good application part.

The experimental part should be improved.

Author Response

Comment 1: The manuscript is significantly improved. Concerning the answer: "Our study is focused on the methodological and comparative evaluation of clustering algorithms, rather than on a specific industrial recommender deployment." I cannot agree that only industrial research should have a good application part. The experimental part should be improved.
Response 1: We sincerely thank the reviewer for acknowledging the improvements made to the manuscript. We also appreciate the emphasis on the importance of connecting methodological contributions with application-oriented perspectives. While the primary scope of our work remains methodological and comparative, we have strengthened the experimental section by explicitly highlighting how the obtained results can be applied in real-world contexts. In the Discussion section, we now expand on the practical implications of Bayesian NMF in domains such as e-commerce, streaming platforms, and education, illustrating how the experimental findings translate into tangible recommendation scenarios. This addition ensures that the experimental analysis is not only methodologically rigorous but also clearly linked to potential applications.
[Updated manuscript text (Section Discussion - new paragraph)]:
"The experimental findings have direct implications for real-world recommender systems. For instance, the demonstrated ability of Bayesian NMF to handle sparse data and capture overlapping user preferences can be directly applied to e-commerce platforms for personalized product discovery, to streaming services for multi-genre content recommendation, and to educational systems for adaptive learning support. These examples underscore that the methodological advances reported in this study provide actionable insights for practical RS deployments, bridging the gap between algorithmic evaluation and applied use cases."

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed the concerns and revised the manuscript accordingly. They have reduced the redundancies and improved the flow in Intro/Related Works; Provided a clear rationale for exclusion; Justified choice based on comparability with prior work regarding Y-axis standardisation. In the future work, some aspects are suggested.
- Adding systematic hyper-parameter tuning and statistical validation.
- Including benchmarks and sensitivity analyses.
- Reporting random seed settings for reproducibility.

Author Response

Comment 1: The authors have addressed the concerns and revised the manuscript accordingly. They have reduced the redundancies and improved the flow in Intro/Related Works; Provided a clear rationale for exclusion; Justified choice based on comparability with prior work regarding Y-axis standardisation. In the future work, some aspects are suggested.
-Adding systematic hyper-parameter tuning and statistical validation.
-Including benchmarks and sensitivity analyses.
-Reporting random seed settings for reproducibility.
Response 1: We sincerely thank the reviewer for recognizing the improvements made in the revised manuscript. We also appreciate the constructive suggestions for future research directions. In the Conclusions section, we have explicitly added a sentence acknowledging the importance of systematic hyper-parameter tuning, statistical validation, benchmark comparisons, sensitivity analyses, and the explicit reporting of random seed settings for reproducibility. This addition ensures that the limitations of the current work and promising future research avenues are clearly outlined.
[Updated manuscript text (Section Conclusions, new paragraph preceding the final paragraph)]:
"Future extensions of this work should incorporate systematic hyper-parameter tuning strategies combined with statistical validation, benchmarks with additional datasets, sensitivity analyses under varying sparsity conditions, and explicit reporting of random seed settings to further strengthen reproducibility."

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors

The article looks fine in its revised version. Therefore, the article can be accepted.

Author Response

Comment 1: Dear authors. The article looks fine in its revised version. Therefore, the article can be accepted.
Response 1: We sincerely thank the reviewer for the positive evaluation and for recommending acceptance of our work. We greatly appreciate the constructive feedback provided throughout the review process, which has helped us to improve the clarity, rigor, and overall quality of the manuscript.

Article Menu

Matrix Factorization-Based Clustering for Sparse Data in Recommender Systems: A Comparative Study

Further Information

Guidelines

MDPI Initiatives

Follow MDPI