A Two-Stage Neural Network-Based Cold Start Item Recommender

: Nowadays, recommendation systems have been successfully adopted in variant online services such as e-commerce, news, and social media. The recommenders provide users a convenient and efﬁcient way to ﬁnd their exciting items and increase service providers’ revenue. However, it is found that many recommenders suffered from the cold start (CS) problem where only a small number of ratings are available for some new items. To conquer the difﬁculties, this research proposes a two-stage neural network-based CS item recommendation system. The proposed system includes two major components, which are the denoising autoencoder (DAE)-based CS item rating (DACR) generator and the neural network-based collaborative ﬁltering (NNCF) predictor. In the DACR generator, a textual description of an item is used as auxiliary content information to represent the item. Then, the DAE is applied to extract the content features from high-dimensional textual vectors. With the compact content features, a CS item’s rating can be efﬁciently derived based on the ratings of similar non-CS items. Second, the NNCF predictor is developed to predict the ratings in the sparse user–item matrix. In the predictor, both spare binary user and item vectors are projected to dense latent vectors in the embedding layer. Next, latent vectors are fed into multilayer perceptron (MLP) layers for user–item matrix learning. Finally, appropriate item suggestions can be accurately obtained. The extensive experiments show that the DAE can signiﬁcantly reduce the computational time for item similarity evaluations while keeping the original features’ characteristics. Besides, the experiments show that the proposed NNCF predictor outperforms several popular recommendation algorithms. We also demonstrate that the proposed CS item recommender can achieve up to 8% MAE improvement compared to adding no CS item rating.


Introduction
Nowadays, recommendation systems play a critical role in promoting sales and services in many online applications.For instance, 80 percent of movies watched on Netflix came from recommendations, and 60 percent of video clicks came from home page recommendations on YouTube [1,2].Typically, recommendation systems can be classified into content-based (CB), collaborative filtering (CF), and hybrid approaches.Among them, CF is the most popular approach since it does not need to analyze the items' content.Instead, it relies on the relationship between users and items, typically encoded in a rating preference matrix.Although the CF approach has been successfully applied in many domains, it suffers from a cold start (CS) problem [3].The CS problem indicates that many users have not rated new items not to be correctly linked with similar items [4].When little or no preference information is available, the recommendation accuracy drops significantly.To solve this problem, many researchers adopt variant auxiliary data such as text descriptions, images, or videos to derive ratings of CF items [5][6][7].However, most of these auxiliary data are high dimension so that the time for evaluating item similarity is much longer.
In general, matrix factorization (MF) is one of the most popular methods to implement the collaborative filtering (CF) concept [8,9].One strength of MF is that it can incorporate implicit feedback that is not directly given but can be derived by analyzing the user behavior.MF algorithms work by decomposing the user-item matrix into the product of two lower dimensionality rectangular matrices.Much research effort has been devoted to enhancing MF, such as integrating it with neighbor-based models [8], combining it with topic models of item content [10], and extending it to factorization machines for generic modeling of features [11].Despite the effectiveness of MF for collaborative filtering, it is well-known that its performance can be hindered by the simple choice of the interaction function-inner product [12].
This research integrates neural networks and a collaborative filtering method for CS item recommendations to solve the above difficulties.The proposed recommendation system includes two components: the denoising autoencoder (DAE)-based CS item rating (DACR) generator, and the neural network-based collaborative filtering (NNCF) predictor.The DACR generator is to derive the CS item ratings from similar non-CS items using auxiliary textual information.In the generator, the DAE, a neural network-based dimension reduction method, is applied to extract content features from item vectors.With the compact content feature vector, the rating of a CS item can be derived efficiently.The NNCF predictor is designed to deal with a sparse preference prediction problem.In the NNCF predictor, one hot-encoding is applied to convert the representation of users and items into binary sparse vectors.These long sparse vectors are then projected to a dense latent vector in the embedding layer.Next, latent vectors are fed into multilayer perceptron (MLP) layers for user-item matrix learning.When the target user is specified, the trained NNCF predictor will return the ratings of all items.
The major contributions of this study are summarized below.First, we take the benefits of neural networks and a collaborative filtering approach for solving the CS item recommendation problem.To the best of our knowledge, it has been rarely studied before.Second, we apply the DAE, a neural network-based dimension reduction method, to extract the content features from vectors in the proposed DACR generator.The experiments show the proposed DACR generator can overcome the sparsity and redundancy of the high dimension vector and reduce much computational time when evaluating item similarity.Third, we perform experiments on real-world datasets and demonstrate the effectiveness of the proposed recommendation system.The remainder of this paper is organized as follows.Section 2 reviews the relevant research.Section 3 introduces the proposed CS item recommendation system by integrating deep neural networks and a collaborative filtering approach.Section 4 describes an implementation case to show the feasibility and performance of the proposed system.Section 5 presents conclusions and future work suggestions.

Cold Start Problems
When the dataset is sparse, recommender systems are difficult to provide high-quality recommendations.A method to alleviate the new user cold start problem for recommender systems applying collaborative filtering was presented by [13].They proposed a model in combination with similarity techniques and prediction mechanisms for retrieving recommendations.A novel approach for alleviating the cold start problem by imputing missing values into the input matrix was proposed by [14].Their system combined local learning, attribute selection to optimize the recommendation process.They have evaluated the proposed framework on one synthetic and two real datasets, using four different matrix factorization algorithms.A novel solution for cross-site cold start product recommendation to recommend products from e-commerce websites to users at social networking sites in cold start situations was proposed by [15].They used the linked users across social networking sites and e-commerce websites as a bridge to map users' social networking features to another feature representation for product recommendation.A hybrid interactive context-aware recommender system applied to the tourism domain was proposed by [1].The approach combined case-based reasoning and an artificial neural network to overcome the cold start problem for a new user with little prior ratings.The proposed method can suggest a tour to a user with limited knowledge about his preferences and considers the user's preference changing during the recommendation process.A hybrid recommendation model for dealing with the cold start problem, in which item features were learned from a deep learning architecture SDAE from the descriptions of the items retrieved, then using these features to the time SVD++ model was proposed by [3].The experiments were performed to evaluate the collaborative filtering recommendation model, calculating the RMSE with the movie dataset.
A novel crowd-enabled framework, called CrowdStart, utilizes crowds' wisdom via crowdsourcing.The intuition behind the CrowdStart framework is based on the conventional expert systems was proposed by [16].The knowledge of domain experts helps solve complex problems that are difficult to solve with machine-only algorithms.The experimental results show that the crowd workers provide relevant, diverse, reliable, and explainable crowd-based neighbors for the new item.The crowd-based neighbors are helpful for new item recommendations.A niche approach that applies interrelationship mining into item-based collaborative filtering (IBCF) was proposed by [17].The proposed approach utilizes interrelationship mining to extract new binary relations between each pair of item attributes, and constructs interrelated attributes to enrich the available information on a new item.A joint personalized Markov chains (JPMC) model to address the cold-start issues for implicit feedback recommendation system was proposed by [18].The research first utilizes user embedding to mine network neighbors.A two-level model based on Markov chains at both user level and user group level is proposed to model user preferences dynamically.Useful user selection criteria based on the items' attributes and users' rating history, and combine the criteria in an optimization framework for selecting users was designed by [19].By exploiting the feedback ratings, users' previous ratings and items' attributes, their research then generates accurate rating predictions for the other unselected users.A user similarity detection engine (USDE) for solving the lack of initial social links for newcomers was proposed by [20].This paper utilizes users' smart devices enabling the USDE to extract real-world social interactions between users automatically.The USDE uses a user clustering algorithm to identify similar users based on their profiles and then provide more personalized recommendations.

Deep Learning-Based Recommendation Systems
With the development of artificial intelligence, much research showed that deep learning-based methods got a good performance in the recommendation systems.Autoencoders have been applied in CF recommenders in the last few years.A deep learning model stacked denoising autoencoder and used it in integration with probabilistic matrix factorization was adopted by [21].To satisfy the need for relational deep learning, they proposed a probabilistic formulation for stacked denoising autoencoder and then extended it to a relational stacked denoising autoencoder model.An autoencoder framework called AutoRec for collaborative filtering was proposed by [22].A collaborative denoising auto-encoder (CDAE) method for Top-N recommendation was presented by [23].A deep autoencoder model trained end-to-end without any layer-wise pre-training for the rating prediction task was proposed by [24].However, different from these works in which autoencoders are integrated into the rating prediction process, the autoencoder in our study is to extract compact content features from the high-dimensional vectors.
The convolutional neural networks (CNN) to perform the hashtag recommendation problem was proposed by [25].They proposed a novel attention-based CNN architecture to incorporate a trigger word mechanism, including a local attention channel and a global channel.Experimental results showed that the proposed method could achieve significantly better performance than the state-of-the-art methods.A dual-net deep network model to make recommendations of images to users was designed by [26].The network consists of two sub-networks, which map an image and user preferences into the same latent semantic space.To alleviate the platform editors' working load by automating the manual article selection process and recommending a subset of articles that fits the human editor's taste and interest was aimed by [27].They proposed a dynamic attention deep model to address the problems for the editor article recommendation task.The model used character-level text modeling and convolutional neural networks to learn the representation of each article effectively.
A neural architecture called PACE to bridge collaborative filtering and semi-supervised learning for point of interest (POI) recommendation was developed by [28].The PACE is a deep neural architecture that jointly learns the embedding of users and POIs to predict user preference over POIs and various contexts associated with users and POIs.A hybrid method called location-aware personalized news recommendation with explicit semantic analysis (LP-ESA), which recommended news using both the user's interests and geographical contexts was proposed by [29].They further proposed a novel method called LP-DSA to exploit recommendation-oriented deep neural networks to extract dense.Experimental results showed that LP-DSA further improves the news recommendation.The inner product with neural architecture was replaced by [10].They presented neural network architecture to model users' latent features and items to tackle the implicit collaborative filtering problem.
A hybrid neural recommendation model to learn the deep representations for users and items from both ratings and reviews was proposed by [30].Three major components: a rating-based encoder, a review-based encoder, and the prediction module, are proposed.Besides, the research introduces a novel review-level attention mechanism incorporating rating-based representation as a query vector to select valuable reviews.A novel multicriteria collaborative filtering model based on deep learning was proposed by [31].The model obtains the users' and items' features and uses them as an input to the criteria ratings deep neural network, which predicts the criteria ratings.Those criteria ratings then input to the overall rating deep neural network for rating prediction.A novel hybrid probabilistic matrix factorization model, which tries to model users' preferences from their auxiliary information and differentiate the effect of the core terms extracted from item's comments was proposed by [32].Mainly, two sub deep learning-based components are designed for this task.A global objective function that optimizes model parameters under a unified framework is proposed.A deep hybrid recommendation model that integrates matrix factorization with a convolutional neural network (CNN) was proposed by [33].Furthermore, this research offers an adversarial training framework to learn the hybrid recommendation model, where a generator model is built to learn the distribution over the pairwise ranking pairs.A performance evaluation of a recommending interface (PERI) framework to automate an optimal recommending interface adjustment according to the characteristics of the user and their goals was proposed by [34].In the framework, a deep neural network is used to predict the efficiency of a particular recommendation presented in a selected position and with a chosen degree of intensity.A session-based graph convolutional neural network (GCNN)-based product recommendation model that incorporates similarity between multiple users to produce an optimized, accurate, and intelligent recommendation system was proposed by [35].The experiments showed that the complexity and computational time were decreased by estimating the similarity among nodes and sampling the nodes before training.

Methodology
Let U = {u 1 , ..., u N } be the set of N users, V = {v 1 , ..., v M } be the set of M items.The users' feedback for the items can be represented by an N × M preference matrix R where r uv is the preference value for item v by user u.In this study, r uv is explicitly provided by the user in the form of an integer value (e.g., 1-5).Let U(v) = {u ∈ U | r uv = null} denote the set of users that expressed a preference for item v.An item v is defined as a cold start item (CS item) if |U(v)| ≤ ρ where | • | is the cardinality of a set and ρ is a given threshold value.Typically, text descriptions of an item can provide beneficial auxiliary information to describe an item's characteristics.If we can find the non-CS items with similar text descriptions with a CS item, the CS item's rating can be derived from the ratings of those similar non-CS items.Therefore, this study's first goal is to develop a CS item rating generator based on ratings of similar non-CS items.The derived ratings of CS items are then added to the original preference matrix R and formed an updated preference matrix R ∈ R N×M .Based on R , the second goal of this study is to develop a robust recommendation model that can deal with sparse preferences and accurately predict the user's preference.
To fulfill the above goal, this research proposes a two-stage CS item recommender.The major components of the proposed recommender include the denoising autoencoder-based CS item rating (DACR) generator and the neural network-based collaborative filtering (NNCF) predictor.In the DACR generator, textual descriptions of items are collected and used to generate items' content features.Through the preprocessing tasks of tokenization, stop-words removal, and stemming, a set of meaningful terms is derived from the textual description of all items.Each item is then represented as the vector format, where each entry in the vector represents the occurrence frequency of the term.Next, the DAE, a neural network-based dimension reduction method, is applied to extract compact content features from the high-dimensional vectors.Based on the compact content feature vectors, a CS item's rating can be derived from the ratings of similar non-CS items in a more efficient way.In the second stage, the updated user-item rating matrix R is used to train the NNCF predictor.First, the unique identifications of users and items are converted to vectors format through one-hot encoding.The long sparse vectors are then projected to a dense latent vector in the embedding layer.Next, latent vectors are fed into multilayer perceptron (MLP) layers for user-item matrix learning.The objective of the NNCF predictor is to minimize the loss among the predicted ratings and real ratings.When the user ID is specified, the trained NNCF predictor will return the ratings of all items.After sorting the ratings, the Top-N item suggestion will be returned to the user.Figure 1 illustrates the framework of the proposed cold start item recommendation system.

The DACR Generator
The primary tasks in the DACR generator include text preprocessing, content feature extraction, and CS item rating generation.

Content Information and Text Preprocessing
It is straightforward to take the textual description of an item to describe the characteristics of the item.For example, the textual description (movie plot) "The classic

The DACR Generator
The primary tasks in the DACR generator include text preprocessing, content feature extraction, and CS item rating generation.

Content Information and Text Preprocessing
It is straightforward to take the textual description of an item to describe the characteristics of the item.For example, the textual description (movie plot) "The classic Shakespearean play about a murderously scheming king staged in an alternative fascist England setting" can be useful content information for representing the movie "Richard III."Typically, the textual description of an item contains many words offering less useful information.Therefore, the text preprocessing, including tokenization, stemming, and stop-words removal, will be applied to all texts.Tokenization is the procedure of splitting a text into words, phrases, or other meaningful parts.Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes.Stop words are commonly encountered in texts without dependency on a topic (such as conjunctions, prepositions, articles, etc.).After the preprocessing, a set of meaningful terms T = {t 1 , t 2 , . . ., t d } (also called a bag of words) is generated.Based on T, an item v i can be displayed as is the occurrence frequency of term t j for item v i .

Content Feature Extracting Using DAE
Typically, item v i is a spare vector because the number of dimensions in the vector is large.This makes similarity evaluation between two items inefficient.In this study, the denoising autoencoder (DAE) neural network [36] is used to reduce an item's vector space while maintaining its essential characteristics.The autoencoder can be divided into an encoding part and a decoding part.The encoding part encodes the input data to get the low dimensional representation in few layers.In contrast, the decoding part returns the low dimension vector respective to its original dimension vector.
Mathematically, the autoencoder takes an input vector x ∈ [0, 1] d and maps it to a hidden representation y ∈ [0, 1] d through a deterministic mapping y = f θ (x) = s(Wx + b) parameterized by θ = {W, b}.W is a d × d weight matrix, b is a bias vector, and s is an activation function.The resulting latent representation y is then mapped back to a reconstructed vector z ∈ [0, 1] d where z = g θ (y) = s(W y+b ) with θ = {W , b }.The weight matrix W of the reverse mapping may optionally be constrained by W = W T , in which case, the autoencoder is said to have tied weight.Each training x (i) is thus mapped to a corresponding y (i) and reconstruction z (i) .The parameters of this model are optimized by minimizing the following average reconstruction error: where L is a loss function such as the traditional squared error L(x,z) = x − z 2 .This optimization can be carried out by different kinds of methods, such as stochastic gradient descent.An alternative loss, suggested by the interpretation of x and z as either bit vectors or vectors of bit probabilities is the reconstruction cross-entropy: To improve the traditional autoencoders' effectiveness, a modified autoencoder called the denoising autoencoder (DAE) is proposed [36].Except for the encoding part and decoding part, the denoising autoencoder has a corrupted part.The first step of DAE applies stochastic mapping x ∼ q D ( x x) to randomly corrupt the input data x.The random corruption will force the partial input data to change to 0, and other values will remain.These random 0 values, which are destroyed randomly, will train the denoising autoencoder to restore these damaged data.Finally, it will calculate the error between the restored data and the original data.Figure 2 visually shows the process of denoising autoencoder.Note that x is the input data and will be converted to x by partial corruption.The same as the traditional autoencoder, the corrupted data x will reduce the dimension and map it to the hidden layer y = f θ ( x) = s(W x + b).Next, it reconstructs the z = g θ (y) = s(W y + b ) back to the original dimension.Finally, the optimal parameters of the denoising autoencoder are found by training data to minimize the average reconstruction error.
and decoding part, the denoising autoencoder has a corrupted part.The first step of DAE applies stochastic mapping ~( | ) D q x x x to randomly corrupt the input data x.The random corruption will force the partial input data to change to 0, and other values will remain.These random 0 values, which are destroyed randomly, will train the denoising autoencoder to restore these damaged data.Finally, it will calculate the error between the restored data and the original data.Figure 2 visually shows the process of denoising autoencoder.Note that x is the input data and will be converted to x by partial corrup- tion.The same as the traditional autoencoder, the corrupted data x will reduce the di- mension and map it to the hidden layer W y b back to the original dimension.Finally, the optimal parameters of the denoising autoencoder are found by training data to minimize the average reconstruction error.

CS Item Rating Generation
Based on the trained DAE, item vi will be converted to , where , ij y is the value of compact content feature j for item vi.To generate ratings for a CS item, the similarity between the CS and non-CS items should be evaluated first.In this study, Pearson's correlation coefficient is used to assess the item similarity.Let

CS Item Rating Generation
Based on the trained DAE, item v i will be converted to y i =< y i,1 , y i,2 , . . ., y i,d > where y i,j is the value of compact content feature j for item v i .To generate ratings for a CS item, the similarity between the CS and non-CS items should be evaluated first.In this study, Pearson's correlation coefficient is used to assess the item similarity.Let y i and y j be the compact content feature vectors for non-CS item i and CS item j, respectively.The similarity between items i and j is defined as: where y i and y j are the mean values of vectors y i and y j .Next, we can derive the ratings of the CS item from their α most similar non-CS items.The predicted rating for user u on CS item j is formulated as: where r ui is the real ratings for user u on non-CS item i, S α (u, j) denotes the set of the α most similar non-CS items to a CS item j and user u.

The NNCF Predictor
In this study, a neural network-based collaborative filtering (NNCF) predictor is proposed to predict the ratings in the updated preference matrix R ∈ R N×M where N is the number of users and M is the number of items.As shown in Figure 3, the core architecture of the NNCF predictor includes an input layer, embedding layer, multilayer perceptron (MLP) layers, and output layer.The input layer is composed of two vectors u o i and v o j that are represented in one-hot encoding format after being converted from the unique identifications of user u i and item v j , respectively.Each user vector u o i and item vector v o j rendered as a binary sparse vector is further projected to a dense vector in the embedding layer.The transformed vector is called a latent vector with K-dimension.The Appl.Sci.2021, 11, 4243 8 of 18 user latent vector and item latent vector are fed into the MLP layers.Each layer in the MLP layers can be customized to discover the specific latent structures of user-item interactions.The final output layer is the predicted rating r ij .The prediction function of the NNCF predictor can be written as: where P ∈ R N×K and Q ∈ R M ×K , denoting the latent factor matrix for users and items, respectively; Θ f means the model parameters of the function f.
proposed to predict the ratings in the updated preference matrix ' 

R
where N is the number of users and M is the number of items.As shown in Figure 3, the core architecture of the NNCF predictor includes an input layer, embedding layer, multilayer perceptron (MLP) layers, and output layer.The input layer is composed of two vectors o i u and o j v that are represented in one-hot encoding format after being converted from the unique identifications of user i u and item j v , respectively.Each user vector o i u and item vector o j v rendered as a binary sparse vector is further projected to a dense vector in the embedding layer.The transformed vector is called a latent vector with K-dimension.The user latent vector and item latent vector are fed into the MLP layers.Each layer in the MLP layers can be customized to discover the specific latent structures of user-item interactions.The final output layer is the predicted rating ˆ'ij r .The prediction function of the NNCF predictor can be written as: where  In the NNCF predictor framework, the MLP layers consist of at least three layers of nodes, including the input layer, single or multiple hidden layers, and output layer.Except for the input nodes, each node uses a nonlinear activation function, and each layer is fully connected to the next layer except the output layer.MLP utilizes a supervised learning technique called backpropagation for training and tries to minimize the loss with actual ratings.Let the user latent vector p i be P T u o i and item latent vector q j be Q T v o j .The MLP model under the NNCF framework can be defined as follows: where w x , b x , f x , denote the weight matrix, bias vector, activation functions for the xth layer respectively; σ and h represent the activation function and edge weight of the output layer, respectively.In this study, the training is performed by minimizing the pointwise loss between r ij and r ij with training data.After getting the model's optimal parameters, the ratings for user u i and item v j can be predicted correctly.

Datasets and Data Collection
To demonstrate the feasibility and efficiency of the proposed CS item recommendation system, a real-world dataset created by the Netflix Prize is adopted.The dataset contains 100,498,277 ratings contributed from 480,189 anonymous users for 17,770 movies.The density of the dataset is 1.1778%.Each record in the dataset includes movie IDs, user IDs, users' ratings for the movie, and rating date.The average number of ratings per user is 209.The textural descriptions (movie plots) for the movies are scratched from OMDb (Open Movie Database).Since not every movie in the Netflix dataset can be found in OMDb, the movies with missing textural descriptions are removed.Besides, the users who have rated less than 100 times are also deleted.Finally, the dataset of 49,771,100 ratings provided by 87,121 users for 12,747 movies is used in the following experiments.

Recommendation Illustration
Typically, textual descriptions contain many trivial and meaningless words.The preprocessing tasks, including tokenization, stop words removing, and stemming, are performed to generate a set of valid terms from textual descriptions.After text preprocessing, 16,444 terms are obtained.Based on the set of terms, every movie is represented as <v i,1 , v i,2 , . . ., v i,16444 > where v i,j is the occurrence frequency of term j in movie i.The term vectors with 16,444 dimensions cause the sparsity problem and result in long item similarity evaluation time.Therefore, the DAE is applied to extract compact content features from high-dimensional vectors.To improve the performance of the DAE, one hidden layer is added between layers x and y, and one hidden layer is added between layers y and z.Table 1 shows the settings of topological structure and training parameters of the DAE.Besides, the dimensionality of reduced content features (the numbers of nodes in y) is suggested as 164 after a set of experiments.The effect of the DAE settings will be further discussed in Section 4.3.For illustration purposes, we first define a movie receiving no greater than four rating (i.e., the given threshold value ρ = 4) as a CS movie.In this case, 20 items such as movie ID 549, 617, 684, 990, and 1007 are considered the CS items, while the rest of the movies are considered non-CS movies.Pearson's correlation coefficient in Equation ( 3) is applied to evaluate the similarity between each pair of a CS movie and a non-CS movie.Based on the ratings of α similar non-CS items, the CS items' ratings for each user can be generated using Equation (4).Table 2 shows the predicted ratings of the five example CS movies for all users when α is set as 30.For example, the predicted ratings for user seven and movie ID 549, 617, 684, 990, 1007 are 4, 5, 4, 3, and 4, respectively.Note that "-" in the table indicates "no predicted rating" since no similar non-CS movies can be found when applying Equation ( 4).The effect of α value on the recommendation result will be further discussed in Section 4.3.
Next, the NNCF predictor is built based on the updated preference matrix R .Since one-hot encoding is applied in the NNCF predictor, the user vector and movie vector's input dimension are 87,122 and 12,747, respectively.An entry in both user vector and movie vector is projected to latent vectors with 32 dimensions in the embedding layer.A two hidden layer structure in the MLP is applied where the Relu activation function is used.The settings of the topological structure and parameters for the NNCF predictor are shown in Table 3.The well-trained NCCF predictor is then used to generate predicted ratings for all movies when a target user is requested.The movies with higher ratings are then feedbacked to the user.Table 4 shows that the top 50 ranked movies are suggested for User ID 6.

Parameter Analysis
In this section, a set of experiments are conducted and analyzed to show how parameter settings affect the proposed recommendation system's performance.

Dimension Reduction Using the DAE
In this study, the DAE is used to extract the essential features from the textual vector.Two critical factors in the DAE, the number of nodes in the hidden layers and the number of nodes in the latent representation layer (i.e., y), are further studied, while other parameters are the same as those in Table 1.First, the number of nodes in the hidden layers is changed according to a predefined compression ratio where the compression ratio is defined as compression ratio = number of nodes in the input layer number of nodes in the desired layer (10) For example, if the compression ratio is 10, the number of nodes in the hidden layer will be 1644 (=16,444/10).In this experiment, the set of compression ratios (2, 2.5, 3.3, 5, 10, 12.5, 16.7, 25, 50, and 100) is conducted.Figure 4a shows the relationship between the validation loss and epoch for different compression ratios.It is observed that the lowest validation loss for most compression ratios occurred when the epoch is between 11 and 16.Besides, when the compression ratio is 10 (at epoch 12) and 16.7 (at epoch 13), the validation loss is minimum.Figure 4b shows the training time of the DAE when different compression ratios are applied.The training time surges when the compression ratio is small.Therefore, considering the computational efficiency and model accuracy, the number of nodes in the hidden layer is suggested as 987 (=16,444/16.7).
den layers is changed according to a predefined compression ratio where the compression ratio is defined as number of nodes in the input layer compression ratio= number of nodes in the desired layer (10) For example, if the compression ratio is 10, the number of nodes in the hidden layer will be 1644 (=16,444/10).In this experiment, the set of compression ratios (2, 2.5, 3.3, 5, 10, 12.5, 16.7, 25, 50, and 100) is conducted.Figure 4a shows the relationship between the validation loss and epoch for different compression ratios.It is observed that the lowest validation loss for most compression ratios occurred when the epoch is between 11 and 16.Besides, when the compression ratio is 10 (at epoch 12) and 16.7 (at epoch 13), the validation loss is minimum.Figure 4b shows the training time of the DAE when different compression ratios are applied.The training time surges when the compression ratio is small.Therefore, considering the computational efficiency and model accuracy, the number of nodes in the hidden layer is suggested as 987 (=16,444/16.7).Second, the number of nodes in the latent representation layer is also changed according to a set of compression ratios.Figure 5a shows the relationship between the validation loss and epoch for compression ratios 100, 125, 166.6, 250, 500, and 1000.At the first several epochs, the validation loss is significant for high compression ratio cases.After that, the loss decreases slowly for all compression ratios.However, the computational time for evaluating item similarity in Equation (4) will be much longer if a small compression ratio is applied.Figure 5b illustrates the computational time for item similarity evaluation for several compression ratios.Note that compression ratio 1 indicates no DAE feature reduction function is applied.When no feature reduction is used, the time for item similarity evaluation is 11.49seconds (i.e., the vector's dimension is 16,444).Significantly, it takes only 2.87 seconds to complete the item similarity calculation if the dimension of the vector is reduced to 164 (i.e., the compression ratio is 100).Therefore, considering the computational efficiency and model accuracy, the number of nodes in the latent representation layer is suggested as 164.Besides, the experiment shows that applying DAE for feature reduction in the proposed system can significantly shorten the item similarity evaluation time while keeping the essential characteristics of the original features.
Second, the number of nodes in the latent representation layer is also changed according to a set of compression ratios.Figure 5a shows the relationship between the validation loss and epoch for compression ratios 100, 125, 166.6, 250, 500, and 1000.At the first several epochs, the validation loss is significant for high compression ratio cases.After that, the loss decreases slowly for all compression ratios.However, the computational time for evaluating item similarity in Equation (4) will be much longer if a small compression ratio is applied.Figure 5b illustrates the computational time for item similarity evaluation for several compression ratios.Note that compression ratio 1 indicates no DAE feature reduction function is applied.When no feature reduction is used, the time for item similarity evaluation is 11.49seconds (i.e., the vector's dimension is 16,444).Significantly, it takes only 2.87 seconds to complete the item similarity calculation if the dimension of the vector is reduced to 164 (i.e., the compression ratio is 100).Therefore, considering the computational efficiency and model accuracy, the number of nodes in the latent representation layer is suggested as 164.Besides, the experiment shows that applying DAE for feature reduction in the proposed system can significantly shorten the item similarity evaluation time while keeping the essential characteristics of the original features.To show the benefit of the proposed NNCF predictor, the following popular recommendation algorithms are compared:  BasiclineOnly algorithm [37] predicts the baseline estimate for a given user and an selecting an appropriate number of nodes in the embedding layer is critical for achieving optimal recommendation results.

Performance of the Proposed CS Item Recommender
As mentioned in Section 3, item v is called a cold start item (CS item) if the number of ratings for v is no greater than a given threshold value ρ.For simplification, most previous studies simply ignore the CS items or even remove them from the dataset.Unlike previous works, this study derives the CS item ratings from similar non-CS items using auxiliary textual information by the proposed DACR generator.
To show the benefits of the proposed CS item recommender, the CS items' ratings under different threshold values are derived and tested.For the ease of understanding, DACRρ indicates the DACR generator is applied to the CS items that receive no greater than ρ ratings.For example, DACR4 means that the DACR generator generates the ratings for CS items that receive no greater than four ratings in our dataset.Table 7 summarizes six DACR models and the number of CS items that the six models will deal with.Note that DACR0 indicates no CS item rating generated using the proposed DACR generator.In addition to the number of CS items, CS item rating might be affected by the number of most similar non-CS movies, α, as shown in Equation (4).Thus, α is changed from 10, 20, 30, 40 to 50 in the following discussion.

Performance of the Proposed CS Item Recommender
As mentioned in Section 3, item v is called a cold start item (CS item) if the number of ratings for v is no greater than a given threshold value ρ.For simplification, most previous studies simply ignore the CS items or even remove them from the dataset.Unlike previous works, this study derives the CS item ratings from similar non-CS items using auxiliary textual information by the proposed DACR generator.
To show the benefits of the proposed CS item recommender, the CS items' ratings under different threshold values are derived and tested.For the ease of understanding, DACR ρ indicates the DACR generator is applied to the CS items that receive no greater than ρ ratings.For example, DACR 4 means that the DACR generator generates the ratings for CS items that receive no greater than four ratings in our dataset.Table 7 summarizes six DACR models and the number of CS items that the six models will deal with.Note that DACR 0 indicates no CS item rating generated using the proposed DACR generator.In addition to the number of CS items, CS item rating might be affected by the number of most similar non-CS movies, α, as shown in Equation (4).Thus, α is changed from 10, 20, 30, 40 to 50 in the following discussion.An item v is defined as a CS item if the rating received is no greater than the given threshold value ρ.
Figure 7 illustrates the MAE for different DACR + NNCF combinations.For example, Figure 7a shows the MAE when 6 DACR models are applied with NNCF 1 .When no CS item rating is added (i.e., DACR 0 ), the MAE of the proposed recommender is 0.6577.However, when the DACR 4 model is applied, the MAE decreases to 0.6483, a 1.45% improvement (=(0.6577− 0.6483)/0.6483)for α = 10.Moreover, when the DACR 12 model is applied, the MAE decreases to 0.6062, an 8.49% improvement (=(0.6577− 0.6062)/0.6062)for α = 50.Figure 7b-d shows similar trends and patterns.Table 8 summarizes the average MAE improvement compared to no CS item rating added when NNCF 1 to NNCF 4 are applied.Based on Figure 7 and Table 8, it is clear that when more CS items are added (from DACR 4 to DACR 12 ), lower MAE values can be found.Besides, when α increases, the MAEs for all DACR models decrease also.However, larger α makes the computation time much longer when evaluating item similarity.Except for the Netflix datasets, two more popular datasets are experimented with to show the performance of the proposed CS item recommender.The two datasets are Amazon All Beauty and Amazon CDs & Vinyl [40].Table 9 shows the features of the two datasets in which they have very sparse density.Figure 8a,b shows the MAE under different DACR settings with NNCF4 model for All Beauty dataset and CDs & Vinyl dataset, respectively.Note that DACR0 means no DACR generator is applied while DACR 1 means that the DACR generator will generate the ratings for CS items that receive no greater than one rating in the dataset.Figure 8a shows All Beauty dataset reveals similar trends with Netflix datasets in which when α increases, the MAE decreases.Figure 8b, however, illustrates when α = 50, the MAE increases.It indicates that fifty similar non-CS movies (neighbors) might be too many for rating generation.Selecting an appropriate number of neighbors when conducting a DACR generator is critical for better performance.

Conclusions
Recommendation systems are now playing an essential role in many online applications.Companies such as Amazon, Google, and Netflix have massively applied the technique to their services by estimating their potential customers' preferences.Although many recommendation methods have been proposed recently, most previous researches suffer from the cold start (CS) problem [41].Only a small number of ratings are available for some items.To solve the difficulties, this research develops a two-stage CS item recommendation system.The system includes two major components, which are the DACR generator and the NNCF predictor.In the DACR generator, the textual descriptions of items are adopted as auxiliary information for generating the content features of items.Next, a neural network-based dimension reduction method, denoising autoen-

Conclusions
Recommendation systems are now playing an essential role in many online applications.Companies such as Amazon, Google, and Netflix have massively applied the technique to their services by estimating their potential customers' preferences.Although many recommendation methods have been proposed recently, most previous researches suffer from the cold start (CS) problem [41].Only a small number of ratings are available for some items.To solve the difficulties, this research develops a two-stage CS item recommendation system.The system includes two major components, which are the DACR generator and the NNCF predictor.In the DACR generator, the textual descriptions of items are adopted as auxiliary information for generating the content features of items.Next, a neural network-based dimension reduction method, denoising autoencoder (DAE), is applied to extract content features from vectors.DAE can the vectors effectively while maintaining the characteristics of the original vectors.
Moreover, it can significantly reduce the computational time for item similarity evaluations.Thus, the CS items' ratings are efficiently derived based on the ratings of similar non-CS items.In the second stage, the NNCF predictor is used to predict the ratings in the updated user-item preference matrix after adding CS items' ratings.In the NCCF predictor, the long sparse vectors are projected to a dense latent vector in the embedding layer as latent vectors.Next, latent vectors are fed into MLP layers for user-item interaction learning.When the user ID is specified, the trained NNCF predictor will return the ratings of all items.
A set of experiments shows that the DAE can significantly reduce the computational time for item similarity evaluations while keeping the original features' characteristics.Besides, the experiments indicate that the proposed NNCF predictor outperforms several popular baseline algorithms.Finally, we demonstrate that the proposed CS item recommender can achieve up to 8% MAE improvement compared to adding no CS item rating.
Although the proposed system is efficient in solving the CS item recommendation problem, some possible directions can be further improved.First, this study adopted the textual descriptions of items as auxiliary information to derive CS items' ratings.It is worthwhile to apply different content information such as images and videos to derive items' content features.Second, although the performance of vector dimension reduction done by the DAE is significant, different types of DAE can be tested.Third, overfitting might appear in the proposed NNCF recommender.Further study can try different strategies such as dropout and L1/L2 regularization to avoid the overfitting difficulties.Finally, it might be interesting to apply the proposed system to variant applications such as online music and news.

Figure 1 .
Figure 1.The framework of the proposed cold start item recommendation system.

Figure 1 .
Figure 1.The framework of the proposed cold start item recommendation system.

Figure 2 .
Figure 2. The schematic structure of the DAE.

Figure 2 .
Figure 2. The schematic structure of the DAE.

Figure 3 .
Figure 3.The structure of the NNCF predictor.Figure 3. The structure of the NNCF predictor.

Figure 3 .
Figure 3.The structure of the NNCF predictor.Figure 3. The structure of the NNCF predictor.
(a) Validation loss (b) Training time

Figure 4 .Figure 4 .
Figure 4. Experiments for the number of nodes in the hidden layers of the DAE.

Figure 5 .
Figure 5. Experiments for the number of nodes in the latent representation layer of the DAE.4.3.2.Rating Prediction Using the NNCF Predictor

Figure 5 .
Figure 5. Experiments for the number of nodes in the latent representation layer of the DAE.

Figure 6 .
Figure 6.Effect of the number of nodes in the embedding layer of the NNCF predictors.

Figure 8 .
Figure 8.The performance of the proposed recommender for two Amazon datasets under different DACR settings.

Figure 8 .
Figure 8.The performance of the proposed recommender for two Amazon datasets under different DACR settings.

Table 1 .
The structure and parameter settings for the DAE.

Table 2 .
The predicted ratings for five example CS movies.

Table 3 .
The structure and parameter settings for the NNCF predictor.

Table 4 .
Top 50 recommended movies for User ID 6.

Table 6 .
The performance for the NNCF predictors.

Table 7 .
The summary for the six DACR models.Effect of the number of nodes in the embedding layer of the NNCF predictors.

Table
The summary for the six DACR models.

Table 8 .
The average MAE improvement compared to no CS item rating added.

Table 9 .
The main features for the two Amazon datasets.

Table 9 .
The main features for the two Amazon datasets.