Enhancing Sequence Movie Recommendation System Using Deep Learning and KMeans

: A flood of information has occurred, making it challenging for people to find and filter their favorite items. Recommendation systems (RSs) have emerged as a solution to this problem; however, traditional Appenrecommendation systems, including collaborative filtering, and content-based filtering, face significant challenges such as data scalability, data scarcity, and the cold-start problem, all of which require advanced solutions. Therefore, we propose a ranking and enhancing sequence movie recommendation system that utilizes the combination model of deep learning to resolve the existing issues. To mitigate these challenges, we design an RSs model that utilizes user information (age, gender, occupation) to analyze new users and match them with others who have similar preferences. Initially, we construct sequences of user behavior to effectively predict the potential next target movie of users. We then incorporate user information and movie sequence embeddings as input features to reduce the dimensionality, before feeding them into a transformer architecture and multilayer perceptron (MLP). Our model integrates a transformer layer with positional encoding for user behavior sequences and multi-head attention mechanisms to enhance prediction accuracy. Furthermore, the system applies KMeans clustering to movie genre embeddings, grouping similar movies and integrating this clustering information with predicted ratings to ensure diversity in the personalized recommendations for target users. Evaluating our model on two MovieLens datasets (100 Kand 1 M) demonstrated significant improvements, achieving RMSE, MAE, precision, recall, and F1 scores of 1.0756, 0.8741, 0.5516, 0.3260, and 0.4098 for the 100 K dataset, and 0.9927, 0.8007, 0.5838, 0.4723, and 0.5222 for the 1 M dataset, respectively. This approach not only effectively mitigates cold-start and scalability issues but also surpasses baseline techniques in Top-N item recommendations, highlighting its efficacy in the contemporary environment of abundant data.


Introduction
Internet technology and artificial intelligence have attracted researchers, industries, and markets to the Internet of Things (IoT) in recent decades.IoT connects heterogeneous physical objects to collect data from smart gadget sensors [1].Smartphones, smartwatches, laptops, sensors, and their internet connectivity generate big data or structured and unstructured data.Big data dominates research in business, healthcare, monitoring systems, transportation, smart homes, and more [2][3][4].Patient monitoring systems use IoT devices to capture real-time data for personalized therapy and early disease detection.Businesses use big data to improve their decision-making, streamline operations, and customize consumer experiences, enhancing performance and competitiveness.IoT gadgets in smart homes use data for convenience, security, and energy management, demonstrating how big data can improve daily life.
• We interpret the recommendation model using the sequential temporal of user interactions (movie ratings) to deliver a dynamic and contextual comprehension of user preferences based on the Movielens dataset.Our approach adopts a Transformer architecture, integrating multi-head attention with user demographic and movie embeddings, which allows the model to weigh various aspects of a user's movie-watching history differently when making predictions for the next target movie.Sequential recommendations are referred to as advanced model-based CF, as this is more effective in tackling the issues with existing traditional techniques.• Then, the model contains a KMeans clustering post process to group movies into clusters depending on their embeddings, which aids in diversifying recommendations.
It also integrates movie genres as extra attributes, boosting the model's capacity to represent varied movie qualities.The algorithm is designed to anticipate Top-N recommendations for users, employing clustering to ensure a mix of genres and preferences.The evaluation measures expand beyond typical loss functions to include precision, recall, and F1 scores, offering a comprehensive view of the model's performance.
The structures of the paper are arranged as follows: Section 2 illustrates related works regarding movie recommendations.Section 3 pertains to our proposed framework; we present the system architecture and a theoretical transformer model.Section 4 shows the experimental results and data analysis and compares other baseline methods.Finally, Section 5 is described in the conclusion of the work.

Related Work
This section introduces the traditional-based recommendation systems and recent advance-based recommendation systems, which are the major core of our model.Currently, RSs have been widely investigated in various disciplines of artificial intelligence, including different perspectives of machine learning techniques, to solve specific difficulties.

Traditional-Based Recommendation Systems
Several studies have investigated the evolution of recommendation systems, which traditionally comprise two parts: content-based filtering and collaborative filtering [20].Content-based filtering (CBF) has focused on recommending new items that are similar to those a user liked in the past, relying on item features and user profiles.Pazzani, Michael J and Daniel Billsus, [21] emphasized the importance of item features and user profile creation for effective recommendations.The strength of CBF lies in its ability to recommend items similar to those a user preferred in the past, utilizing detailed descriptions and metadata of items.However, it often suffers from a lack of diversity in its recommendations and struggles with the new item (cold start) problem, where new items have not yet been rated sufficiently to be recommended.
On the other hand, collaborative filtering (CF), the more prevalent approach, relies on both implicit and explicit user-item interactions [22].CF utilized the user-item interaction matrix, introducing foundational algorithms like matrix factorization.Despite their effectiveness, challenges such as cold start, scalability, and sparsity issues persist.Sarwar, Badrul et al. [23] addressed scalability with "item-based collaborative filtering", proposing algorithms that improved the recommendation efficiency and performance.Takács, Gábor et al. [24] furthered CF with matrix factorization techniques, enhancing its ability to deal with large datasets and improving recommendation accuracy.Barathy, R, and P. Chitra [25] proposed matrix factorization (MF) and Singular Value Decomposition (SVD) and decomposed the user-item interaction matrix into latent factors, capturing underlying patterns in user preferences and item characteristics to improve recommendation accuracy.Rendle, Steffen et al. [26] utilized Bayesian Personalized Ranking (BPR) to optimize personalized ranking item predictions by extracting latent features representing user preferences and factorizing the user-item interaction matrix into a lower dimension based on explicit or implicit feedback.To overcome the limitations inherent in CBF and CF, researchers explored hybrid models.Sun, Chang et al. [27] presented hybrid news recommendation algorithms that combine content-based methods (using TF-IDF and K-means clustering) with SVD-based collaborative filtering to improve overall performance.Further, Patoulia, Agori Argyro et al. [28] conducted a comparative study of collaborative filtering in product recommendation, which demonstrated that the LightFM library outperforms the surprise library in handling foodservice transactional data, emphasizing LightFM's ability to deal with sparse data and establish personalized recommendations.This paper emphasizes the importance of hyperparameter tuning for optimal algorithm performance and represents a significant development in collaborative filtering techniques.
Initially, recommendation systems relied heavily on content-based and collaborative filtering techniques.Although they are effective, they are limited in their ability to handle complex patterns and large-scale data, particularly when the rating matrix is sparse.

Advanced-Based Recommendation Systems
The advent of deep learning has led to a paradigm shift in recommendation systems.These models excel at capturing complex user-item interactions and integrating diverse data types, such as textual and visual information, into the recommendation process.
Zhang, Shuai et al. [29] surveyed and offered new perspectives on deep-learningbased recommender systems.Neural collaborative filtering (NCF) demonstrated how deep neural networks can learn user-item interaction patterns more efficiently than traditional matrix factorization methods.This approach leverages a multi-layer perception to learn the non-linear interactions between user and item, significantly enhancing the recommendation accuracy.While neural-network-based models and the application of Restricted Boltzmann Machines have offered a nuanced understanding of user-item interactions, deep learning also has enhanced content analysis capabilities, with convolutional neural networks (CNNs) being applied for feature extraction from non-textual content, such as images and videos, significantly improving content-based recommendations.Auto Rec and CDAE utilize an autoencoder framework to extract latent vectors for user-item dynamics, predicting user ratings and generating high-quality movie recommendations.In the realm of recommendation systems, Xinchang, Khamphaphone, Phonexay Vilakone, and Doo-Soon Park [30] proposed using social network analysis and collaborative filtering to form the recommendation system, and the authors' method solved the cold-start problem in the traditional approach.Moreover, they applied the community detection method to cluster user similarities and recommend a movie list to a target user based on similar preferences.Aiming to develop an improved collaborative movie recommendation system which combines K-means clustering with neural networks, Jing and Hui [31] introduced the hybrid approach, which applies K-means to cluster movies into groups and then trains a neural network model to learn users' preferences based on the clusters to provide more accurate and personalized recommendations, especially for new users with sparse data.Wang, Kai et al. [32] presented a novel model that combines k-means clustering with a deep neural network to generate personalized recommendations for users in e-commerce applications, which can effectively solve problems of sparse data and information overload.Chen, Jianguo et al. [33] applied K-means clustering in healthcare recommendation systems to group patients with similar medical histories.Clustering patients based on demographic and medical data facilitated personalized treatment recommendations, improving healthcare outcomes.
The traditional domain of collaborative filtering has undergone substantial evolution with the interpolation of deep learning models, particularly matrix factorization techniques combined with neural networks that can capture temporal dynamics in useritem interactions to predict the next items.Early work on sequential recommendation systems was generally based on Markov Chains (MC).For instance, for Rendle, Steffen, Christoph Freudenthaler, and Lars Schmidt-Thieme [34], FPMC was the most commonly used technique when combining power matrix factorization and MC to make next-basket recommendations, encoding users' short-term interests, which demonstrated a good performance using sparse datasets.Tang, Jiaxi, and Ke Wang [35] also proposed a convolutional sequence embedding (caser), using the Top-N model of the embedded sequential pattern as the local features of the image.Both horizontal and vertical convolutional filters from the embedding matrix of high-order Markov chains were used as the image in order to capture high-level sequential patterns.Hidasi, Balázs et al. [36] developed a novel CF model that incorporates recurrent neural networks (RNN) to track changes in user preferences over time (as in the case of Netflix).This approach addresses the traditional limitations of CF, such as cold start and data sparsity problems, by providing a dynamic representation of user interests, leading to more accurate and timely recommendations.The recurrent neural networks (RNNs) can capture the sequence dependence among user-item interactions in a behavior sequence to predict the possible interactions of the next item.Basically, when dealing with long-term sequences using RNN, backpropagation will face a few acute problems, such as gradient disappearance.In addition, it only handles point-wise dependency.Choe, Byeongjin, Taegwan Kang, and Kyomin Jung [37] used gated recurrent units (GRU), and Duan, Jiasheng et al. [38] used long short-term memory (LSTM), popular models that encode items into a dense vector to reflect users' interest in various recurrent architectures to improve their version of the model and further improve the Top-N recommendation performance.Those complex models require large amounts of data to capture long-term patterns, i.e., overfitting easily occurs in high-sparsity settings.They cannot be run in parallel, which is time-consuming.
Even if the current trend of using other state-of-the-art recommendation technology could provide satisfactory results, studies of recommendation models are still a hot research topic, improving their performance in an unstoppable way.The integration of deep learning into RS using a transformer model has been a significant area of innovation.The transformer architecture, originally used for natural language processing, has led to a significant leap forward in creating adaptive systems that can capture long-range dependencies in sequences of user-item interactions and improving recommendation performance and interpretability.Kang, Wang-Cheng, and Julian McAuley [39] introduced self-attentive sequential recommendations (SASRec), while Yu, Saisai et al. [40] proposed personalized movie recommendation algorithms that fuse visual and textual features using multi-head attention with neural networks that can address data sparsity and cold start problems.Chen, Qiwei et al. [41] presented a Behavior Sequence Transformer for E-commerce Recommendation in Alibaba, a model designed to enhance recommendation systems within the e-commerce domain.This is specifically built upon the transformer architecture, a popular deep-learning architecture that is effective in sequence-to-sequence tasks.Wang, Dongjing et al. [42], revolutionized a transformer-based RS that adapts to both sequential and contextual information in user interactions.By employing self-attention mechanisms, the system can dynamically weigh the importance of different items in a user's history, leading to highly personalized and context-aware recommendations.In addition, existing works focus on understanding users' playlists or listening sequences, which have inspired many sequential recommendation models.Chen, Quanzhen et al. [43] introduced a hybrid model incorporating GNNs to leverage both user-item interactions and content features.The approach allows for a more comprehensive understanding of user preferences and item characteristics, leading to improved recommendation diversity and accuracy.Additionally, hybrid systems employing the transformer architecture have been developed to better capture sequential user behavior and item attributes to make dynamic recommendations.

Methodology of Movie Recommendation Systems
In this section, we provide an overview of the recommendation architecture and the important requirements for predictive tasks.Additionally, we introduce the process, from data processing to the point where the model can generate a list of movie recommendations for the target users.

Data Processing and Sequence Creation
Firstly, we explain the preprocessing steps required in our implementation.In our work, we utilize users' ratings from the MovieLens dataset to construct a recommendation system.Initially, we merge all the rating information for an individual user into the required input format for our transformer model [44].We then construct a vocabulary for movie IDs and user IDs and create sequences of user interactions in chronological order.All user interactions are first sorted by their interaction timestamp and then divided into subsequences for the training model.To facilitate subsequent calculations, we convert the list of movie IDs and the movie ratings to a fixed length and retain a position set of subsequence information.These sequences are further divided into subsequences to create a structured input format that encapsulates a user's interaction history up to a fixed length, thereby preparing the data for the transformer model.After processing, the input is generated for each individual user in a sequential manner, including user ID, movie ID, sequence rating, the target (label) that the model attempts to predict-which would be the target movie ID-and the rating of the last item in the sequences, as shown in Table 1.To enhance the input data, we integrate additional attributes, such as movie genres in the sequences and the demographic information (age, gender, occupation) of the users, into our final dataset.The final dataset was then subjected to a series of cleaning procedures, including removing duplicates, handling missing values, and addressing outliers, before being finalized for analysis.Furthermore, some data had to be converted from categorical to numerical formats to be compatible with the transformer model and k-means clustering in the embedding layer.Variables like gender and occupation were encoded from categorical to numerical formats through a process of factorization.The age feature underwent normalization using the MinMaxScaler, which scales the data to fall within a specified range.This enhances model performance by ensuring numerical features are on a similar scale.Movie genres were expanded into binary features, each indicating the presence of a specific genre, thus enriching the dataset with explicit genre information, as shown in Figure 1 on step 1 and in Figure 2.

Model Architecture
This section introduces the model architecture of the entire recommendation system as detailed in Figure 1.There are two main processes in our system.The first process involves applying deep learning and key transformer features to predict the next movie based on the user's sequence and information.The transformer model was implemented with a multi-head attention layer and positional embedding, which are adept at understanding the complex viewing patterns of users.The multi-head attention mechanism allowed the model to focus on different parts of the user's movie history to provide a comprehensive understanding of their preferences.The positional embeddings provide the model with an understanding of the order in which movies were watched, which was crucial for predicting future interests.Subsequently, we fed an embedding layer of the user behavior sequence, which was implemented using a transformer architecture.This was then combined with user demographics following multi-layer perceptron (MLP) to predict the rating.In the second step, after training the transformer model, we integrated the output predicted rating into K-mean clustering to generate a Top-N recommendation system for the target user.K-means clustering is a popular method in cluster analysis, which is designed to partition a set of objects into K clusters in such a way that the sum of the squared distances between the objects and their assigned cluster mean is minimized.Our objective was not only to develop recommendation algorithms that personalized the results based on historical preferences but also to dynamically adjust diversity in the recommendations list according to the user's interest.To achieve this, we utilized the K-means approach to personalized diversity, which automatically segments movies into distinct groups based on user preferences (target next movie) according to certain predefined categories (movie embeddings associated with the movie's genre).When creating the recommended list, we investigated these clusters to ensure a diverse suggestion of Top-N movies when a user interacts with our system.Similarly, for new movies with few or no ratings, clustering helps us identify and recommend unseen movies to target users based on their similarity score with other movies.The K-means clusters aid in diversifying the selected top N by ensuring that movies from different clusters are included in the recommendations.Therefore, our approach helps in providing a variety of recommendations to the user, ensuring that these recommendations are not limited to a specific genre or type of movie [45,46].

Process ONE-Predicting User Ratings for Movies
This section introduces the first process used for model rating prediction.The primary purpose of our work was to predict how a particular user might rate a given movie.
The process includes various necessary elements suitable for our predictive model.The model took in features such as user ID, movie ID, user's demographic data, and historical movie ratings (sequence experiences of the target user), and output predicted ratings.

Process ONE-Predicting User Ratings for Movies
This section introduces the first process used for model rating prediction.The primary purpose of our work was to predict how a particular user might rate a given movie.
The process includes various necessary elements suitable for our predictive model.The model took in features such as user ID, movie ID, user's demographic data, and historical movie ratings (sequence experiences of the target user), and output predicted ratings.

(a) User-Movie Embedding Layer
We divided learning data into two parts: user demographic and movie sequence embedding.User demographic data consisted of data such as user ID, gender, age, and occupation.Each category of input features first passed through the embedding layer to become a dense feature vector.An embedding matrix u j ∈ R uxd was created to transform the integer index into a dense vector of fixed size, where u represented the vocabulary size (the total number of unique categorical elements), and d was the embedding dimension.These dense vectors could typically reduce the dimensionality of the input features, capturing the underlying relationships and characteristics of each categorical feature to feed them directly into the multi-layer perceptron (MLP).
Movie features contained the movie IDs in sequence, and movie genre.Firstly, the movie IDs, in sequence, were transformed into dense feature vectors through the embedding layer.The input movies' genre features were transformed into feature vectors through muti-hot encoding.We also obtained embeddings for each movie ID in the behavior sequence, including the target movie, which was essential for the transformer model to understand the sequence's temporal dynamics.Furthermore, we incorporated a learnable position encoding matrix to enhance the input order sequence to address the lack of inherent positional information in the transformer architecture.For each movie, we then concatenated the movie ID sequence with the positional encoding matrix and created an embedding matrix Êi ∈ R nxd , which represents the embedding vector of the i-th movie in each behavior sequence of user u after adding the position vector.In our process, we fed our sequence of movie embeddings into a single transformer layer, before concatenating the output with the user features.

(b) Model Training
The transformer model was a type of deep learning model that was primarily used in the processing of sequential data such as natural language.It could capture the user dynamic nature influences on users' recent activities.In our study, we analyzed the structure of the transformer layer, which enhanced the model's ability to capture the longrange dependencies and relationships between movies in a sequence, allowing it to make more accurate and personalized recommendations.The transformer layer comprised a multi-head attention layer and a Position-Wise Feed-Forward Neural Network (FFN).
The multi-head attention mechanism allowed for the model to focus on different sections of the input sequence in different ways (e.g., longer-term dependencies versus shorter-term dependencies) when making prediction ratings for each target movie.The multi-head attention module can be used in algorithms.The module is expected to be process information multiple times in parallel via an attention mechanism.This mechanism allows the model to jointly attend to information from different representation subspaces at different positions, providing a more complex understanding of the input sequence.After that, the output of the distinct attention was connected and linearly transformed into the aspect that was predicted.In our case, the multi-attention operation takes the embedding Êi as input.The input is passed through three separate linear layers to produce the Q, K, and V matrices, as presented below: where W Q , W k , and W V ∈ R dxd are the projection matrices that makes the model more flexible, Ê is the embedding matrix of all movies, W 0 is a learnable weight matrix that can form the final representation of the movie at i, and h is the number of heads.
In our transformer layer, a point-wise feed-forward network (FFN) enabled the model to adapt, providing the flexibility needed to understand complex movie interactions within the sequence.These networks had hidden layers and non-linear activation functions, which allowed the model to learn complex patterns in the movie ID sequence.FFN consisted of two fully connected layers.At the same time, we utilized dropout layer normalization to optimize the model, while avoiding overfitting and speeding convergence with LeakyReLU in FFN.These were applied to each position, separately and identically, during training.The overall output of the transformer layer between the multi-head attention and stacking FFN layers, which represents the candidate movie that contains characteristics of the users' behavior sequence, can be defined as follows: where W 1 , W 2 ∈ R dxd , b 1 , and b 2 ∈ R d refer to the learnable weight's matrix and bias parameters.Next, we concatenated the user demographic embedding layer u j and the output of the transformer, with target item as T i , for the representation operation of the two vectors and passed them together through a multilayer perceptron (MLP) network.The output layer computed the predicted rating of the next target movie using a dense layer.The dense layer allowed for a linear transformation of all flattened and concatenated features, as expressed below: where ŷi,j is the predicted rating for user j on movie i-th, W is the weight matrix for the output layer, (.) is the dot product between the weight matrix W and the input vector x i,j ,B is the bias vector for the output layer, and σ is the sigmoid activation function.

Process TWO: Leveraging Predictions for Recommendations
In this second process, we aimed to leverage the predictive tasks resulting from Process One's output to model a recommendation for a Top-N recommendation list.After training, we applied KMeans clustering to movie embeddings to group movies into clusters based on the learned features.This clustering could assist in the recommendation process by identifying movies that are similar to each other.
To recommend movies, we first predicted ratings for a user's movies and then selected the top-N movies based on these predicted ratings.Clustering information was used to refine recommendations, ensuring diversity and focusing on a specific genre or group of similar movies.The linkage of KMeans clustering to our model could be explained as follows: After training the recommendation model, we extracted embeddings for each movie.These embeddings were the output of the movie embedding layer in our neural network model.This step involved KMeans clustering, with the following processes: (1) Choose the number of clusters, K, based on domain knowledge, heuristic methods (like the elbow method), or experimentation.(2) Initialize the KMeans algorithm with K clusters and fit this on movie embeddings.(3) Each movie should be assigned to the nearest cluster centroid.To generate Top-N recommendations, the system predicted ratings for movies that had not yet been rated by a user, leveraged clustering information to ensure diversity or focus on specific interests, and then selected the top-rated movies as recommendations.

Experimental Study
To verify the effectiveness of our proposed model, we applied two well-established versions, 100 K and 1 M [47], to the experiment, as presented in Table 2. MovieLens is a popular benchmark dataset for R.S. evaluation, which consists of many user attributes.The 100 K datasets consisted of interactions between 944 users and 1682 items (movies) that were used for the experiment.Meanwhile, 1 M datasets were larger, with interactions between 6040 users and 3883 items.The dataset was structured into three distinct CSV files, which provided a brief description of important attributes such as user ID, movie ID, movie title, rating, timestamp, gender, age, occupation, zip code, title, and genres.Each dataset consists of various ratings from anonymous users, from 1 to 5. We gathered relevant data on user-item interactions and user demographics such as age, gender, movie genres, and occupation as input features to speed up our system, to obtain a better result in terms of recommendations.After the data were preprocessed, our dataset was divided into training and testing datasets, respectively.The training dataset contained 80 percent of the Movielens dataset, and the remaining 20 percent belonged to testing data.In the model, ratings were considered target values in a sequence, which needed to be fit.Furthermore, we saved the training and testing datasets to CSV files for implementation in our training and evaluation model.The sample data in Figure 2 refer to the data obtained from the data processing of Movielens.

Environment Set-up of Transformer Model
In our architecture model, we performed parameter optimization with Adam, a learning rate of 0.001, and a default batch size of 128.We set hyper-parameters of the transformer layer as L = 2, attention head as ℎ = 8, sequence length as N = 4, inner size as 256 (e.g., FNN layers), and hidden size of MLP = [128, 128].After each layer, we used a dropout layer with a dropout chance of 0.2 to reduce overfitting.We set a feed-forward layer followed by a Softmax layer to predict the probability of a movie.The training process was terminated after a maximum of 100 epochs.These procedures were implemented using TensorFlow 2.13.0 and Python 3.8.10.The hardware and software environment used for implementing the task included Windows 11 Enterprise 64-bit, a 12th Gen Intel(R) Core(TM) i7-12700 CPU @ 4.02 GHz, and 32.0 GB RAM.

Environment Set up of K-Means Algorithm
In this section, we determined the number of clusters based on the k-value setting using the Elbow method.This method involves iteratively fitting K-Means models with varying values of k to evaluate their performance.We applied the MovieLens dataset to generate the top 10 recommendations by predicting the next target movie, specifically movie_9 in the 100 K dataset and movie_2926 in the 1 M dataset.These movies were identified among the distinct clusters of other movies after model training, and classified into separate groups using the k-means algorithm, along with genre information.To identify the optimal number of clusters, we applied the Elbow method by running k-means clustering on the dataset for a range of k-values (e.g., from 1 to 50), and then computed the sum of squared distances from each point to its assigned center (inertia) for each k.This  After each layer, we used a dropout layer with a dropout chance of 0.2 to reduce overfitting.We set a feed-forward layer followed by a Softmax layer to predict the probability of a movie.The training process was terminated after a maximum of 100 epochs.These procedures were implemented using TensorFlow 2.13.0 and Python 3.8.10.The hardware and software environment used for implementing the task included Windows 11 Enterprise 64-bit, a 12th Gen Intel(R) Core(TM) i7-12700 CPU @ 4.02 GHz, and 32.0 GB RAM.

Environment Set up of K-Means Algorithm
In this section, we determined the number of clusters based on the k-value setting using the Elbow method.This method involves iteratively fitting K-Means models with varying values of k to evaluate their performance.We applied the MovieLens dataset to generate the top 10 recommendations by predicting the next target movie, specifically movie_9 in the 100 K dataset and movie_2926 in the 1 M dataset.These movies were identified among the distinct clusters of other movies after model training, and classified into separate groups using the k-means algorithm, along with genre information.To identify the optimal number of clusters, we applied the Elbow method by running k-means clustering on the dataset for a range of k-values (e.g., from 1 to 50), and then computed the sum of squared distances from each point to its assigned center (inertia) for each k.This process requires running the algorithm multiple times in a loop, with an increasing number of cluster choices, and then calculating a clustering score as a function of the number of clusters.The optimal cluster values for different pairs of genres are 4, 5, 7, and 10.In our experiments, the elbow point was observed at k = 4, indicating this to be the most optimal number of clusters for our K-Means clustering to provide a suitable balance between capturing the underlying data structure and avoiding excessive model complexity.

Evaluation Metric
To check the performance of our sequential recommendation system, we adopted the leave-one-out evaluation for the next-item recommendation task.For each user, we used the last movie of the behavior sequence as the test set and utilized the remaining movie for the training set.We followed the common convention of pairing each ground truth in these test sets with 100 random negative items that the user had not interacted with, giving them all the same context as their corresponding positive test item, and ranking the positive test item among them.Lastly, we truncated the rank list at the threshold value for each user.We measured the overall quality using the hit rate (HitRate@k) and normalized discounted cumulative gain (NDCG@k).These mercies can be calculated as follows: where HitRate@k is the number of correct predictions in the Top-N recommendations, and k is the number of recommendations or the cutoff value for which we want to calculate the HitRate.In our experiments, all datasets had a rating range of 1-5, and we set the threshold at 3.5.If the threshold for an item was not met, we omitted the item.The second metric that was adopted was the normalized discounted cumulative gain (NDCG@k), which measured the Top-N recommendation system list quality, expressed as follows: where rel i was the actual rating score of an item at the rank position i For our experiment, we gave results for HitRate and NDCG with k = [5, 10].For all metrics, a high value corresponded to a better performance.Moreover, we used root mean squared error (RMSE) to measure the proposed model performance.RSME was the approach we used to measure the error rate a user gave to the system and the error predicted by the model.For example, the RSME is shown as follows: where y i is the actual value, ŷi refers to the predicted value of the observation, and n refers to the total number of data points.
To assess the diversity of the recommended movies, we utilized a metric based on the pairwise distances between the movie's embedding.These embeddings are highdimensional vectors that represent movies in a latent space, capturing various aspects of the movie's genre.Then, the K-means algorithm was used to cluster these embeddings, grouping similar movies together based on their features to make our recommendations.We calculated the Euclidean distance between every pair of movie embeddings in the recommended list.Finally, the diversity score was defined by taking the average of all these pairwise distances, expressed as follows: where E i − E j represents the Euclidean distance between the embeddings of movies i and j, and n is the total number of recommended movies.

Benchmark Models
In this section, we selected benchmarks including BPR, NeuMF, Caser, GRU4Rec, and SASRec to evaluate the performance.We implemented them using TensorFlow and the RecBole library and considered latent dimensions {10, 20, 30, 40, 50} as the common hypermeter in all models.The maximum number of training epochs is 100.In addition, we adopted the NeuMF model with the hidden size h = [64, 32,16].We tuned the hyperparameter using the validation set and terminated training if the validation performance did not improve at 40 epochs.The benchmark model is detailed as follows:

•
Bayesian Personalized Ranking (BPR): The model is an optimization technique applied to matrix factorization, explicitly tailored to the handling of implicit feedback to enhance the factorization process by incorporating a pairwise ranking loss function.

•
NeuMF: This is a collaborative filtering model that uses user-item interactions with an MLP instead of the inner product in matrix factorization when calculating the relationship between the user and item.

•
Caser: This adopts convolutional neural networks (CNNs) in horizontal and vertical dimensions to effectively model high-order Markov Chains (M.C.s) for sequential recommendations.

•
GRU4Rec: This is a unidirectional recurrent neural network (RNN)-based framework that is used to capture sequential dependencies and make predictions.

•
SASRec: This is a state-of-the-art sequential recommendation model leveraging selfattention blocks to predict the next item for recommendation.Additionally, it employs the dot product computation between sequential latent features of the latest item and embeddings of the target item to establish the scoring mechanism.

Results Analysis
This subsection explains the result of our proposed system during the experiment using the performance metrics across two Movielens datasets, 100 K and 1 M. We evaluated the performance of our algorithm using common Top-N evaluation metrics, HitRat@K and NDCG@K, highlighting its effectiveness in providing accurate and relevant recommendations.Firstly, we compared our transformer model training with a non-sequence model such as BPR, NeuMF, and sequence models including GRU4Rec, Caser, and SASRec.The non-sequence model only suggests movies to the user, without explicitly considering the sequential order of their preferences or viewing history, and ignores temporal information.The sequence model considers the temporal order or sequence of a user's interactions with the recommender.The user's behavior sequence information can effectively characterize the user's changing preferences to a certain extent, improve the recommendation performance on sparse datasets, and learn the long-term dependencies.
In Table 3, our transformer model is shown to achieve the best HitRat@5, HitRat@10, NDCG@5, and NDCG@10 results on all datasets compared to the BPR, NeuMF, GRU4Rec, Caser, and SASRec models.The experimental results were HitRat@5, HitRat@10, NDCG@5, and NDCG@10 at (0.5676, 0.6633, 0.3714, 0.3783) on 100 K and (0.7034, 0.7309, 0.5869, 0.6238) on 1 M.This indicated that the sequential movie recommendation system of our proposed transformer architect could effectively extract useful information from movies to help us automatically select information with a long-term dependency and achieve a more accurate predicted rating of next target movies than other models.Furthermore, our approach conveys the effectiveness of utilizing the multi-head attention mechanism to capture relevant information in different representation subspaces for different tasks and is time-consuming.The mechanism only pays attention to important movies.It drowns out irrelevant movies for recommendation.Figures 3 and 4 show our analysis of the latent dimensionality hyperparameter, highlighting the significance of higher latent dimensions d in achieving optimal results, ranging from 5 to 50, across metrics with k values 5 and 10 of HitRate and NDCG to understand the impact on algorithm performance.In this illustration, we study models such as BPR, NeuFM, GRU4Rec, Caser, SasRec, and our model, with various d values, which increased its value in both datasets.Moreover, our proposed model's performance steadily improved the embedding dimensions increased from 20 to 50.This indicated the best performance, with the model learning more efficiently when a larger d was used to solve overfitting.Overall, for sparse datasets, the larger the embedding dimension, the better the performance that can be achieved.This variation has a relatively stronger effect on 1 M than 100 K.A higher HitRate@10 value indicates a greater number of relevant results, making the model more effective in suggesting the next_target movie.
The performance evaluation of our movie recommendation system was carried out after utilizing the transformer model integrated K-means.Table 4 demonstrates effectiveness in enhancing recommendation quality and diversity.On the MovieLens 100 K dataset, our model achieved an RMSE of 1.0756, MAE of 0.8741, precision of 0.5516, recall of 0.3260, F1-score of 0.4098, and item coverage of 0.1165.The intra-list diversity, when applied with K-means, was 0.2447, which is an improvement over the 0.2173 intra-list diversity in the non-K-means scenario.In contrast, in the larger 1 M dataset, performance was enhanced, as evidenced by the improved metrics: an RMSE of 0.9927, MAE of 0.8007, precision of 0.5838, recall of 0.4723, F1-score of 0.5222, and item coverage of 0.3216.The intra-list diversity for this dataset improved to 0.3007 with K-means, compared to 0.2878 without K-means.These outcomes illustrated the model's scalability and its enhanced ability to make more accurate predictions with larger datasets.The higher precision, recall, and F1-score on the 1 M dataset indicated a stronger ability to identify and recommend relevant movies to users.The significant increase in item coverage for the 1 M dataset highlights that our method is effectiveness in recommending a broader spectrum of movies, thereby potentially enriching the user experience by exposing them to a wider variety of content.Although intra-list diversity slightly decreases with a larger dataset, the model maintains a balance when recommending popular movies, ensuring a diverse set of recommendations.This balance demonstrated the system's capacity to address critical challenges in recommendation systems, such as accuracy, diversity, and coverage, making it a valuable tool for personalized movie recommendations in an era of information overload.The suggests that the model is improving its predictions over time.The model trained on the larger dataset 1 M demonstrated a lower initial loss and achieved a smaller overall final loss compared to the model that was trained on the smaller dataset "100 K".This suggests that the larger dataset might provide more information, allowing the model to learn a better representation of the underlying data distribution and become capable of making accurate predictions on unseen data.
creased.The suggests that the model is improving its predictions over time.The model trained on the larger dataset 1 M demonstrated a lower initial loss and achieved a smaller overall final loss compared to the model that was trained on the smaller dataset "100 K".This suggests that the larger dataset might provide more information, allowing the model to learn a better representation of the underlying data distribution and become capable of making accurate predictions on unseen data.

Conclusions
This work offers a unique sequence movie recommendation system that employs deep learning techniques to enhance user experiences by providing tailored movie suggestions based on the captured sequential relations.Our study mainly focuses on user demographics and user-item interactions in a sequence to help overcome the issues in standard algorithms' recommendations caused by cold-start and data sparsity or scalability.We used user demographic and movie sequence embedding as input and added this to the transformer architecture to handle sequential data to generate the next target movie.
The transformer model in our system exploits movie sequence embeddings, translating these dense vectors into a multi-head attention mechanism.This comprehensive technique helps increase the understanding of user behavior dynamics.It interacts with a multilayer perceptron (MLP) fed by user demographic data to discover intricate relationships, hence refining the personalized recommendations for the next target movie.Additionally, we applied KMeans clustering to detect underlying patterns in movie qualities that may not be visible from the anticipated ratings post-model training.This concept enables our system to widen the range of choices by suggesting movies from various clusters with comparable genre features.This technique considerably enhances the personalized Top-N movie recommendations, matching them with the user's essential preferences, including those that are not immediately apparent.

4. 1 .
Implementation Detail 4.1.1.Environment Set-up of Transformer Model In our architecture model, we performed parameter optimization with Adam, a learning rate of 0.001, and a default batch size of 128.We set hyper-parameters of the transformer layer as L = 2, attention head as h = 8, sequence length as N = 4, inner size as 256 (e.g., FNN layers), and hidden size of MLP = [128, 128].

Figure 4 .
Figure 4. Performance evaluation in terms of NDCG and with a hit rate of 1 M from positions 5 to 50 (i.e., k = [5, 10].Figure 4. Performance evaluation in terms of NDCG and with a hit rate of 1 M from positions 5 to 50 (i.e., k = [5, 10]).

Figure 4 .
Figure 4. Performance evaluation in terms of NDCG and with a hit rate of 1 M from positions 5 to 50 (i.e., k = [5, 10].Figure 4. Performance evaluation in terms of NDCG and with a hit rate of 1 M from positions 5 to 50 (i.e., k = [5, 10]).

Figure 5
Figure 5 illustrates the training error with the initial setting.The model showed a consistent decrease in both training and validation losses as the number of epochs increased.The suggests that the model is improving its predictions over time.The model trained on the larger dataset 1 M demonstrated a lower initial loss and achieved a smaller overall final loss compared to the model that was trained on the smaller dataset "100 K".This suggests that the larger dataset might provide more information, allowing the model to learn a better representation of the underlying data distribution and become capable of making accurate predictions on unseen data.

Figure 5 .
Figure 5. Training and validation loss performances of our proposed model on both datasets.Figure 5. Training and validation loss performances of our proposed model on both datasets.

Figure 5 .
Figure 5. Training and validation loss performances of our proposed model on both datasets.Figure 5. Training and validation loss performances of our proposed model on both datasets.

Table 1 .
Sequence interaction of users and movies.

Table 2 .
Statistics of the Movielens dataset.

Table 3 .
Overall performance comparison with state-of-art efficient movie model at baseline.

Table 3 .
Overall performance comparison with state-of-art efficient movie model at baseline.

Table 4 .
Overall performance of our model on training and testing datasets.

Table 4 .
Overall performance of our model on training and testing datasets.