Sequential Movie Genre Prediction using Average Transition Probability with Clustering

In recent movie recommendations, predicting the user's sequential behavior and suggesting the next movie to watch is one of the most important issues. However, capturing such sequential behavior is not easy because each user's short-term or long-term behavior must be taken into account. For this reason, many research results show that the performance of recommending a specific movie is not very high in a sequential recommendation. In this paper, we propose a cluster-based method for classifying users with similar movie purchase patterns and a movie genre prediction algorithm rather than the movie itself considering their short-term and long-term behaviors. The movie genre prediction does not recommend a specific movie, but it predicts the genre for the next movie to watch in consideration of each user's preference for the movie genre based on the genre included in the movie. Through this, it is possible to provide appropriate guidelines for recommending movies including the genre to users who tend to prefer a specific genre. In particular, in this paper, users with similar genre preferences are organized into clusters to recommend genres, and in clusters that do not have relatively specific tendencies, genre prediction is performed by appropriately trimming genres that are not necessary for recommendation in order to improve performance. We evaluate our method on well-known movie datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations.


Introduction
One of the most important parts of a recommendation system is to model the interactions between users and items, as well as the relationships amongst the algorithm, and compare the obtained results using the previous algorithm and the proposed algorithm. In [8], the authors introduced a recommender system using movie genre similarity and preferred genres. However, in these works, only a static situation was considered, not a sequentially changing recommendation systems.
In this paper, we first consider the prediction of movie genres included in preferred movies before recommending movies. The genre is one of the important features of a movie, which gives guidelines on which movies each user prefers. In the sequential movie recommendation system, we extract the genre included in the movie each user watched and studied the genre preference, and conducted a study on what movie genre the user will see next.
Our main contributions are described as follows: • First, different from most prior researches of movie recommendation systems, we focus on the genres, which are included in a movie rather than the movie itself as a sequential prediction item. Although it cannot be used directly in sequential movie recommendation systems, it can show how well genre-based prediction works in learning user preferences.
• Second, for predicting the long-term user's preference behavior of movie genres, we use RNN-based learning models that show the best performance recently. Further, we consider an Average Transition Probability (ATV) between genres as a Markov chain to reflect the short-term behavior of the user's preference as in [2]. To see the effect of the average transition probability, we consider four kinds of training data with combining genre vectors.
• Third, we propose a clustering approach based on the k-means clustering, which has similar preferences for movie genre. In order to improve the prediction performance, we also propose a method for properly trimming genres that act badly on performance using the results obtained based on the RNN-based sequential learning models presented above.
• Finally, we evaluate our method on well-known movie datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations for the movie genres. The results show that clustering with trimming improves the prediction performance whereas applying ATV has a very negligible performance improvement.
The remainder of this paper is organized as follows. Section 2 discusses related studies. In Section 2, our clustering and training methods are presented. In Section 4, the experiment results of our proposed methods are presented and some limitation and future works are given in Section 5. In Section 6, we conclude the paper.

Related Works
In recent sequential recommendation systems, most works focus on how to predict the short-term and long-term preference dynamic of users. As a shortterm dynamic, the Markov chain approach has been studied. Zimdars et al. [9] described a sequential recommender based on the Markov chains. They studied how to catch sequential patterns to predict the next state with a standard predictor such as a decision tree. Rendle et al. [6] proposed a Factorizing Personalized Markov Chain (FPMC), where they modeled based on user-specific transition with Markov Chain, about the history of a basket. FPMC propagates information among users, items, transitions which has similar favor or patterns to extract the sequential pattern. Shani et al. [10] considered a recommendation system based on Markov decision processes (MDP). For this, they used the Maximum Likelihood Estimates (MLE) of the MC transition graphs and they suggested several heuristic approaches such as clustering and skipping. Mobasher et al. [11] adopted pattern mining methods to extract sequential patterns for generating recommendations. He et al. [3] proposed a Translation-based Recommendation (TransRec) with sequential data. The main approach is to consider items(movies) as a translation vector. Khorasani et al. [12] also used the MC to recommend courses that students taken. To do this, they estimated the transition probability of the MC from the record of courses students take based on MLE and enhanced MLE with skip-gram modeling. Konen et al. [13] considered temporal dynamics and they showed several results by using the evolution of users and items over time-based on Netflix data.
For the long-term dynamics, most recommendation usually relies on Matrix Factorization (MF) or other similarity-based approaches. In the prior work [14] using the MF, the authors considered the recommendation problem as a problem that infers missing values of a partially observed user-item matrix. Srebro et al. [15] proposed the maximum margin MF, which used low-norm instead of low-rank factorizations. Salakhutdinov et al. [16] considered a probabilistic MF (PMF) model that expresses the user preference matrix by multiplication of two lower-rank user and item matrices. The PMF approach was especially effective at making better predictions for sparse user rating data. He et al. [1] suggested an extended FPMC, called Fossil, to present the information of sequential patterns by considering high-order Markov chains and similarity models. As factorization machine-based sequential recommendation systems usually utilize the matrix or tensor factorization to factorize the observed user-item related data into latent factors of users and items for recommendations [4]. Specifically, some works [18,19] have used the estimated latent representations as an input of a network to further calculate an interaction score between users and items, or successive users' actions.
Recently, deep learning technologies have been introduced in the sequential recommendation problem, such as RNNs [4,20,21], Long-Short Term Memory (LSTM) [22,23], and Gated Recurrent Unit (GRU) [24]. These deep learningbased recommender systems have particularly shown a high performance for the sequential recommendation. In [4], the authors suggested new ranking loss functions corresponding to RNNs in the recommendation model. In [20] the authors designed a novel recommendation model named Recurrent Collaborative Filtering (RCF), which combines RNN and CF properly. In [21] the authors introduced an algorithm named Recurrent Translation-based Network (RTN). Their model reflected both short-term and long-term of a user's preference. In [22] the authors considered LSTM to extract the dependencies of both users and movies. Unlike the prior recommendation models, they considered a method of updating the state with recent operations as input. In [23] the authors introduced an LSIC model, Leveraging long and Short-term Information in Content-aware movie recommendation via adversarial training, which combined global behaviors from MF into the RNN for the top-N movies.
In [24] the authors considered a GRU-based RNN for session-based recommendations. Yuan et al. [25] suggested a Convolutional Neural Network (CNN) method that gives a sequence of user-item interactions. In their model, a CNN first puts user-item interactions data into a matrix, regarding such a matrix as an "image" in the time and latent spaces. Wu et al. [26] proposed a GNN to capture the sequential behavior of complex transitions over user-item interactions. Zhang et al. [27] adopted a self-attention mechanism to extract the item-item interaction from the user's historical interactions. Sachdeva et al. [28] considered a variational autoencoder to model a user's preference based on her historical sequential data, and combines latent variables with temporal dependencies for preference modeling. Similar to our works, Choi et al. [7] designed a movie recommendation algorithm based on genre correlations. For this, they assumed that movie genres are defined by some experts such as directors or producers to guarantee reliability. Then, they computed genre correlations and used them in a movie recommendation system. In [8], the authors also consider a movie genre similarity to provide related services in a mobile experimental environment.

Genre prediction Algorithm
In this section, we will propose a movie genre prediction algorithm. To do this, we first classify the genres included in the movie data watched by each user as shown in Figure 2. Next, we cluster the users into similar groups based on the ratings of the movies. Then, we estimate an average transition probability from genre to the genre for each cluster. Using this, we train some deep learning models. Since some sparse data of genres may cause poor performance in predicting the genres, we appropriate trim these after model training, and we finally predict a preferred genre for the group closest to the user. Based on the predicted genre, some suitable movies that contain that genre can be recommended. We describe all the above steps in detail as the following subsections.

Data Preprocessing
As a sequential movie genre prediction, we consider movie data by user ID and timestamp to extract each user's movie sequence in chronological order (left part of Figure 3). We drop user data with five or fewer movie viewing sequences and import user data with five or more movie viewing sequences in the pre-processing. At this time, the five most recent movies generated per user are arranged in chronological order. Next, information on genres included in movies that the user has watched is organized (middle part of Figure 3). One can see that a single movie contains several genres at the same time. We extract all kinds of genres that each film contains and set the data sequence to n-dimensions (n > 0) of one-hot vector, meaning each of the n genres (right part of Figure 3). We denote G the set of genres in the paper.

Clustering
To reflect user similarity, we consider a clustering approach. For this, we consider that each user scores a range between 0.5 and 5 rating for the most recent five movies they watched. Based on the rating sequences of each user, we obtain the average rating of each genre as shown in Figure 4. Let U be the set of users and we consider an average rating data is generated per one user so we have |U| by n rating matrix. Using this matrix, we apply a k-means clustering to obtain clusters. We let C := {C 1 , ..., C k } be a set of clusters C l for 1 ≤ l ≤ k after performing the clustering.  In our genre prediction system, we use transition probabilities from genre to genre. It is known that many approaches for the sequential recommendation, the MC is used to reflect the short-term sequential behavior of a user. The MC assumes the next choice of item depends only on the current choice. Formally, it is described as follows. The transition probability matrix is generated for each cluster. To do this, we first consider the sequence of selected movies for each user in a cluster. However, as described before, a movie may contain multiple genres such as romance, action and comedy, simultaneously. We consider all genres included in the current and next movies and count them in the n by n matrix. We consider the transition matrix for each cluster, separately. Then, for all C l ∈ C, we summarize them for all user's chosen movies and normalize them to obtain the transition probability from genre to genre as shown in Figure 5.  To describe this, we let M l t and M l t−1 be the selected movie sets for all user u ∈ C l at time t and t − 1, respectively. Then, the transition probability of the first-order Markov chain for the movie selection for the cluster l is given by:

K-means clustering
However, in the genre prediction, we focus on the transition from genre (included in a movie) to genre. For this, we let G t ⊂ G be the set of genres which are contained in the movie M l t for all user u ∈ C l at time t. Consider two genres i, j ∈ G, we model the genre transition probability in the cluster C l as:

Estimation of Transition Probabilities.
To make predictions using the transition probability in (2), it needs to be estimated. To do this, we consider the following ratio: where the value of the denominator in (4) is the number of genre i at time transition probabilities from romance to each genre, from action to each genre and from comedy to each genre, respectively. Next, we take an average for these tree transition probabilities and call it an Average Transition probability Vector (ATV) for each selected movie. The reason why we use the ATV is that there is no information about transition from a specific genre to another genre in actual data, only information about transition from a movie including these genres to another movie is given. Formally, the ATV can be presented by: for all j ∈ G. We will use this ATV for training with each user's selected movie sequence.

Model Training 3.4.1. Training Data Types
As training data, we consider the following four types of training data during the model training: (1)   First, the sum of ATV and movie genre embedding data is nothing but performing the summation of movie genre vector and ATV as shown in Figure 6. Second, the multiplication of ATV and movie genre embedding is the data after multiplying these two vectors by component-wise, which results in a new vector. Third, the successive ATV and movie genre embedding is the data that attaches the ATV at the end of the movie genre for the model training. Finally, the movie genre only is the data that consider only the movie genre vector. The reason we consider the training data types in this way is to check how much the ATV considered for short-term dynamics helps to improve the model performance. We will show the results for these four training data types in the experiment later.

Training Models
In our approach, we use RNN-based models to capture the long-term dynamics of sequential movie genre data such as RNN, LSTM and GRU. We will describe these methods in detail as follows.
(1) RNN. First, RNN is one of the deep learning models designed to be useful for sequential data processing. RNN is a recursive model that performs the same function on all input data and the output for the current input depends on past calculations. When the output data is generated, it is copied and sent back into the recurrent network. Based on the current input and the output that it has generated from the previous input, the RNN learns some sequential data and makes a decision. To formally describe, we let x t be the input vector and y t be the output vector at time t as shown in Figure 7. Then, a state value of hidden layer h t at time t is given by: where U, W are model parameter matrix and b is a constant vector. As a function of h t , we consider a hypublic tangential function tanh(·). The output vector y t is given by: where V is a model parameter and f is an activation function. RNN is optimized to approximate the function by capturing sequential patterns. However, if the length of the sequence input to the RNN is long, the effect of the elements at the beginning of the sequence will gradually loosing as the time step progresses and disappear after a certain period of time. This is because the constant value is multiplied equally in each cycle. This is called a long-term dependency problem that the RNN is useful for a short sequence of data.
(2) LSTM. To overcome the main disadvantage of RNN, one of the improved methods, LSTM has been introduced. LSTM [30] is a kind of RNN that is capable of selectively remembering sequences for a long period of time. The main difference from the RNN is that LSTM introduces a "cell state" for each time t, which allows information to flow unaltered. In LSTM, the cell state is regarded as a long-term memory since the previous information is stored in it as a recursive nature of the cells. The forget gate is used to update the cell states. The forget gate outputs values saying which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell. The input gates determine which information should enter the cell states. Finally, the output gate tells which information should be passed on to the next hidden state. Based on this fact, the LSTM addresses the long-term dependency problem of RNN. In general, the LSTM consists of the following four parts as shown in Figure 8: (i) Forget Gate Layer. As a first part, the forget gate layer decides to filter some information from the cell state by using a sigmoid function. It obtains information at h t−1 and x t , and outputs a number between 0 and 1 for each number in the cell state c t−1 . The number 1 implies "completely keep this" while 0 represents "completely drop this." The output of the forget gate vector t t is given by: where σ is a sigmoid function and W f and b f are weight matrix and bias vector parameter. (ii) Input Gate Layer. In the next step, LSTM decides whether new information to store or not in the cell state. For this, an "input gate layer" decides which values we'll update as a sigmoid gate. Next, a tanh gate generates a vector of new values,c t , that could be added to the state. Then, these two layers are combined to create an update to the state. The input gate vector i t is given by: where W i and b i are weight matrix and bias vector parameter. The cell input activation vectorc t is computed by: where W c and b c are weight matrix and bias vector parameter and tanh(·) is a hyperbolic tangential function as a sigmiod function.
(iv) Output Gate Layer. Finally, in the output gate layer, LSTM decides what information going to be output. This output will be based on the cell state, but will be a filtered version. First, it runs a sigmoid layer which decides what parts of the cell state going to output. Then, it puts the cell state through tanh and multiplies it by the output of the sigmoid gate, so that it only output the parts it decided to. The output gate vector o t is given by: where W o and b o are weight matrix and bias vector parameter of the output gate layer. Here, h t is computed by: (3) GRU. Cho et al. [29] first introduced a slight variation on the LSTM, named GRU. It uses the forget and input gates as a single update gate. Further, it also combines the cell state and hidden state. It is known that the GRU is simpler than LSTM model. The detailed structure of GRU as shown in Figure 9 is in what follows: (i) Update Gate. In GRU [31], it first begins with computing the update gate z t for time step t by: where W z is a weight matrix. When the input x t is generated into the network, it is multiplied by its own weight W z . The previous h t−1 also multiplied by the current input x t . As an activation function, a sigmoid is commonly used. The update gate is used to determine how much of the past information (from previous time steps) needs to be passed along to the next. The most useful fact is that the model can control to copy all the information from the past and eliminate the risk of vanishing gradient problems.
(ii) Reset Gate. Next, a reset gate is applied from the model to decide how much of the past information to forget by: The difference between this from the update gate is the weights and the gate's usage. As similar steps in the update gate, it plugs in h t−1 and x t , multiply them with their corresponding weights, sum the results and apply the sigmoid function.
(iii) Current memory content. The current memory content is then used for the reset gate to store the relevant information from the past. It is computed as:h where W is a weight matrix and the operator * denotes the Hadamard element-wise product. Then the result determines what to remove from the previous time steps. In this step, it uses a tanh as the nonlinear activation function.
(iv) Final memory at current time step. As the last step, the network needs to calculate h t , which is a vector that holds information for the current unit and passes it down to the network. In order to do that the update gate is needed. It determines what to collect from the current memory contenth t and what from the previous steps h t−1 by weighting the update gate value:

Sub-genre trimming
Finally, using the evaluation results of the trained models, we perform a sub-genre trimming process based on a pre-defined threshold of the evaluation metric scores for each cluster. To do this, we first select clusters that do not satisfy the criteria of evaluation metrics. In this example, we set η=0.5 and there are two kinds of performance metrics such as precision and recall. We see that the minimum value of evaluation metrics P 2 min = 0.6 and P 7 min = 0.55 for the cluster 2 and cluster 7, respectively. Hence, these clusters are not regarded as the trimming clusters, However, we have P 3 min = 0.45 < 0.5 for cluster 3 so it is regarded as a trimming cluster.

Algorithm 1 Sub-genre Trimming
Input: Set of movie-genre matrices M := {M i } k i=1 for each cluster C i with each evaluated value P i e , threshold parameters η and θ. Output: Sub-genre trimmed matrices set M := {M 1 , ..., M k }.
Set |C i | the length (number of movies) of a cluster C i for each i and set; if P min < η then for 1 ≤ j ≤ n do Count the number of genre j in a column of movie-genre matrix M i and set it by g j = u∈Ci c uj . If g j < 100 × θ i , replace all values of the column j by zero; More precisely, we let P i e be the value of evaluation for a performance metric e ∈ E of the cluster i, where E is a set of performance metrics such as E = {P recision, Recall, Accuracy}. Next, we let P i min := min e∈E P i e be the minimum evaluated value of P e for all e ∈ E. Then, we check the value P i min for each cluster and if there exists an evaluation value less than a pre-defined threshold η > 0, i.e.P i min < η, then we choose the cluster as the target cluster for the sub-genre trimming. For example, if we consider the evaluation metrics as precision and recall and the threshold η is given by 0.5, the cluster 3 does not satisfies this as shown in Figure 10. Hence, it is regarded as a target cluster for the sub-genre trimming. After selecting target clusters, we find subgenres that are less than θ i percent of the total length (number of movies) of each target cluster i to trim. To do this, we let M i = [c uj ] u∈Ci,0≤j≤n be the movie-genre matrix for the cluster i and let M : Then, using the matrix, we find the genres that the number of total sum is less than 100 × θ i , i.e. u∈Ci c uj < 100 × θ i . To minimize the data loss, we replace the values as zero rather than deleting thm. The reason for performing this is that it will increase the accuracy of evaluation of clusters that do not have explicit preference. In the example of Figure 12, the length of the cluster is 100 and θ = 0.1, then we have 100 × 0.1 = 10. We choose genres that do not have more than 10 data in the cluster such as documentary and war. After these procedures, we finally obtain the sub-genre trimmed matrices set M := {M 1 , ..., M k }. We will use this matrices to obtain the performance results.

Experiment Results
In this section, we will show our experiment results. For this, we first use a well-known movie dataset and performance metrics of the evaluation as follows.

Data
In this section, we present our experimential results. For the simulation, we use a movielens data set(ml-25m), where 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users [32]. For the sequential recommendation, we sort the data by 'userId' and 'timestamp' (to extract each user's movie sequence in chronological order) as shown in Figure 3. We drop user data with 5 or fewer movie viewing sequences and import user data with 5 or more movie viewing sequences to configure the dataset for the experiment. At this time, five movie data generated per user are arranged in chronological order. To train the model, we convert the data sequence to 19 dimensions of a one-hot vector, meaning each of the 19 genres: {Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, F antasy, F ilm−N oir, Horror, IM AX, M usical, M ystery, Romance, Sci−F i, T hriller, W ar, W estern}.

Performance Metrics
As performance metrics, we consider (i) Recall, (ii) Precision, (iii) Accuracy and (iv) F1-score. To formally explain these metrics, we denote True Positive (TP) as the number of correctly predicted positive values which is the actual value is yes and the predicted value is also yes. True Negatives (TN) indicates the number of correctly predicted negative values which is the actual value is no and the predicted value is also no. False Positives (FP) is the number of actual value is no and the predicted value is yes. False Negatives (FN) is the number of the actual value is yes but the predicted value is no. Then, the three metrics are described as follow: (i) Precision: A Precision is the ratio of correctly predicted positive answers to the total predicted positive answers.
(ii) Recall: A Recall is the ratio of correctly predicted positive answers to all answers in the actual class of answers.
Recall := T P T P + F N .
(iii) Accuracy: An Accuracy is a ratio of correctly predicted answers to the total answers.
The accuracy is one of good measures when the values of false positive and false negatives of the datasets are almost same.
(iv) F1-Score: This metric is a weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
F1-score is usually more useful than accuracy when the values of false positive and false negatives of the datasets are quite different.
Using the previously described data and performance metrics, we obtain various experimential results of the movie genre prediction in the following subsection.

Results
In the result, we obtain that how much the prediction performances are affected for (1) Clustering, (2) Sub-genre trimming, and (3) ATV. To see the clustering effect, we obtain the results as before and after clustering. We consider seven clusters after applying the kNN during the clustering step and the results show the mean performance of all clusters and best and worst performance of clusters among them, respectively. To select the trimming clusters, we set η = 0.5 and θ i = 0.1 for all cluster i.     As a first result, we obtain the performances of each model before the clustering and after clustering in Figure 11. Without clustering (Figure 11(a), Figure 11(d) and Figure 11(g)), the performances are measured in consideration of all users without distinguishing users based on any criteria. This is because the data is not classified, it is difficult for the models (RNN, LSTM and GRU) to grasp the data itself, and it is difficult to extract any information. Therefore, all three models show relatively poor performance. In order to improve performance, we apply clustering for users with similar preferences. After the clustering (Figure 11(b), Figure 11(e) and Figure 11(h)), we see that the performance is quite improved, and among all stages of our experiment, the range of performance increase is large. In this case, users with similar preferences are grouped together, so that in the case of a group in which preferences are well expressed, the range of values between preferred and non-preferred genres is very large. In other words, the number of data from genres with clear preferences is overwhelmingly large. It can be said that this helped make the process of recommending movies that the group would like to be easier for the model. However, even after doing this, there were occasionally (1 or 2) clusters where the preference was not clearly evident ( Figure 11(c), Figure 11(f) and Figure ??). In Table 1-3, we obtain the results of four performance metrics (Recall, Precision, Accuracy and F1-score) with respect to three training models, respectively. In the experiment, we consider the training data as [Genre*ATV] as a representative one. Here, the abbreviations BC means before clustering and AC means after clustering. AC (best) and AC (worst) indicate the best result and worst result among clusters. Finally, AC (mean) present a mean of all result of clusters. As a result, we see that the performances of AC for four metrics are improved compared to BC except for the worst case. Further, among them, we also check that the accuracy has the highest value since false positives and false negatives of the datasets are not quite different.

Effects of Sub-genre Trimming
In order to maximize the advantages of clustering, we come up with a method of trimming the sub-genres that are not preferred in the cluster. To see this, we set the threshold η by 0.5 and if there is a cluster that does not exceed 0.5 at any one of Recall, Precision, Accuracy and F1-score, it is subject to trimming. For the result, we consider the following three cases: before clustering, after clustering, and after trimming. We use two kinds of training data types such as [Genre, ATV] and [Genre*ATV] for each model. As shown in Figure 12, we see that most of the metrics for the after trimming case has larger value than others except precision. This is because the precision considers the ratio of correctly predicted positive observations to the total predicted positive observations. Further, we also check that the accuracy is the highest values for all three models. In Table 4-6, we obtain the results of four performance metrics (Recall, Precision, Accuracy and F1-score) with respect to three training models, respectively. In the experiment, we consider the training data as [Genre*ATV] as a representative one. Here, the abbreviations BT means before trimming, and AT means after trimming. AT (best) and AT (worst) indicate the best result and worst result among clusters. Finally AT (mean) present a mean of all result of clusters. As a result, we also check that the performances of AT for four metrics are improved.

Effects of Average Transition Probability
Finally, we obtain recommended performance for the four data types described in the model to examine the effect of the ATV application. In the result, we consider the average evaluation values of all clusters. As result, in Figure 13, we see that there was no significant difference in performance before and after clustering when the transition probability was applied and when the transition probability was not applied. Before clustering, there are no distinct characteristics to consider for all users, so it is seen as an environment that is not good for generating a transition probability that can contain the user's preference. After clustering, users with similar preferences are grouped together, so the ATV would have a good effect, but contrary to expectations, the effect is insignificant. Rather, except for RNN, the performance is slightly higher when the transition probability is excluded. Since RNN learns with more weight on recent information due to the characteristics of the model, it is expected that it will effectively represent the transition probability compared to other models. However, the results show that the effects of clustering and trimming are still quite large.

Discussion
A method for genre prediction has been examined as a pre-step for sequential movie recommendation in this work. However, we did not specifically deal with how to make sequential movie recommendations based on this. In fact, in order to solve problems such as cold start due to the limitation of movie data in movie recommendation, [7] also proposed a recommendation system based on correlation information on movie genres. However, this study does not suggest a recommendation method for movies with sequential dynamics. In the sequential movie recommendation system, it is necessary to study how to use the information on the genre of a movie to show the prediction performance well. In addition, it is necessary to design how to select and recommend movies including recommended genres based on the results of sequential movie genre prediction shown in our study. As a learning method for short-term dynamics, we used ATV in Markov Chain. However, the reason why it did not have much effect on the performance is because the RNN-based deep learning we used actually learns some short-term dynamics. Therefore, we will examine whether this short-term dynamic is better estimated by additionally using a higher-order MC that uses information from the past better than the first-order MC that uses the estimation of the next step with the result of the previous step also need. We remain these as our future work.

Conclusion
In this paper, we proposed a sequential movie genre prediction algorithm based on the MC for the short-term behavior and RNN for the long-term behavior of user preference. The movie genre prediction does not recommend a specific movie, but it recommends the genre for the next movie to watch in consideration of each user's preference for the movie genre based on the genre included in the movie. For this, we considered that users with similar genre preferences are organized into clusters to recommend genres, and in clusters that do not have relatively specific tendencies, genre prediction has been performed by appropriately trimming genres that are not necessary for recommendation in order to improve performance. We have performed various experiments using our method on well-known movie datasets, and the results showed that clustering and sub-genre trimming worked, but the AVT was not that great.