Customer Satisfaction of Recommender System: Examining Accuracy and Diversity in Several Types of Recommendation Approaches

: Information technology and the popularity of mobile devices allow for various types of customer data, such as purchase history and behavior patterns, to be collected. As customer data accumulate, the demand for recommender systems that provide customized services to customers is growing. Global e-commerce companies offer recommender systems to gain a sustainable competitive advantage. Research on recommender systems has consistently suggested that customer satisfaction will be highest when the recommendation algorithm is accurate and recommends a diversity of items. However, few studies have investigated the impact of accuracy and diversity on customer satisfaction. In this research, we seek to identify the factors determining customer satisfaction when using the recommender system. To this end, we develop several recommender systems and measure their ability to deliver accurate and diverse recommendations and their ability to generate customer satisfaction with diverse data sets. The results show that accuracy and diversity positively affect customer satisfaction when applying a deep learning-based recommender system. By contrast, only accuracy positively affects customer satisfaction when applying traditional recommender systems. These results imply that developers or managers of recommender systems need to identify factors that further improve customer satisfaction with the recommender system and promote the sustainable development of e-commerce.


Introduction
The e-commerce market continues to grow with the development of information technology and the popularization of mobile devices. However, with new items being released regularly, customers are increasingly spending a significant amount of time and effort selecting items that they want [1]. Therefore, personalized recommender systems are rapidly becoming important, and global companies such as Amazon [2], Netflix [3], and Google [4] are offering various services using recommender systems to maintain a sustainable competitive advantage in e-commerce. Providing products or services that suit customer interests can help reduce customers' efforts to explore offerings and increase customer satisfaction as well as item sales [5]. In particular, a recommender system that provides recommendations using customer purchase history data can help customers choose among various available alternatives [6]. However, personalized recommender systems that do not meet customer expectations may reject recommendations and even show for contempt for personalized services.
Previous studies have focused primarily on enhancing recommender algorithm performance using customer purchasing history or preferences [5][6][7]. The performance of the recommender algorithm was primarily measured using accuracy and diversity metrics [8][9][10][11]. Accuracy shows how well the customer's actual and predicted preference fit, and diversity shows how well customers were recommended items that they had not previously purchased [10,12,13]. Studies of recommendation accuracy have mainly focused on how well recommender algorithms improve predictive accuracy for customers. Thus, general recommender system research aims to increase the predictive accuracy of the model [5,7,[14][15][16][17][18][19]. The study of the diversity of recommendations focuses on how well recommender systems recommend various products that the customer had not previously purchased while maintaining a certain level of accuracy [20][21][22][23]. Generally, if the recommender system provides items suitable for customer preference, customer satisfaction should be increased. However, if the system recommends the same item every time, customer satisfaction will decrease even if the recommender system's accuracy is high [13,24]. Other studies suggest that pursuing diversity while maintaining a certain level of recommender system accuracy can increase customer satisfaction [25,26]. In other words, there seems to be an accuracy-diversity dilemma for recommender systems [8,[27][28][29]. Thus, although research on recommender systems has focused on enhancing the model's performance, customer satisfaction with the recommender system is just as important as improving system performance. Nonetheless, few studies have considered the relation between the performance of the recommender system and customer satisfaction. We believe it is important to address this issue because the recommender system is also an important factor to gain a sustainable competitive advantage for the ecommerce platform.
This study proposes a novel research methodology to identify factors that affect customer satisfaction when using recommender systems on an e-commerce platform. A few studies have determined that the accuracy and diversity of recommendations are positively related to customer satisfaction [8,[30][31][32][33][34][35]. However, in these previous studies, it is not clear that accuracy and diversity affect customer satisfaction. To explore this question, we developed several recommender systems and measured the accuracy and diversity of recommendations and customer satisfaction through a series of experiments with a real diverse dataset. In addition, we adopted the expectancy disconfirmation theory (EDT) approach, which is widely used in online e-commerce to identify customer satisfaction [36,37]. Many previous studies have calculated customer satisfaction with recommendations through surveys, and this study calculates customer satisfaction from simulation experiments using extensive data from e-commerce websites. We show that the proposed customer satisfaction calculation approach can be applied to other domains, including the phenomenon of the entire market. This study seeks to make theoretical contributions by simultaneously considering the customer attitude aspect and its relationship to the recommendation performance aspect. It also identifies how the ecommerce platform facilitates the customer decision-making process from a practitioner aspect.
This study collected a dataset from GroupLens and Amazon, including User ID, Item ID, and Rating. We then constructed accuracy, diversity, and customer satisfaction metrics and used regression models to identify the impact of the accuracy and diversity of recommendations on customer satisfaction. Finally, we studied the prediction power of our proposed factors affecting customer satisfaction using a dataset containing 1,000,209 interactions and 2,023,070 interactions from GroupLens and Amazon, respectively. The results of our experiments indicate that recommendation accuracy significantly influences customer satisfaction. Recommendation accuracy can positively affect customer satisfaction when applying the most popular recommender system algorithms, such as ItemKNN, SVD, and NCF. Additionally, the diversity of recommendations positively affects customer satisfaction only when applying deep learning-based recommender systems such as NCF. These results confirm that accuracy and diversity positively affect customer satisfaction when applying a deep learning-based recommender system. By contrast, only accuracy positively affects customer satisfaction when applying traditional recommender systems. The framework of this study is shown in Figure 1. The remainder of this study is structured as follows: Section 2 describes recommender systems in e-commerce, overviews of the recommender system method, and EDT with customer satisfaction. Section 3 presents the developed research hypotheses. Sections 4 and 5 describe two publicly available datasets, evaluation criteria, and experimental results, respectively. Finally, Section 6 summarizes the research and describes future studies.

Recommender Systems in E-Commerce
Personalized recommender systems in e-commerce research have been regarded as significant issues in approximately the last 20 years [13,38]. Following the success of Amazon, Netflix, Spotify, and others, most e-commerce companies have tried to provide a certain level of personalized recommendation service. Otherwise, e-commerce companies would not last for a long time [39]. Customers are becoming familiar with receiving recommendations via smartphones, and it will not be easy to achieve sales continuity if customers are recommended only products that suit their preferences. Many products are being produced worldwide and introduced to the market, and consumer needs are more diverse than in the past; consequently, customers seek a differentiated personalization experience when purchasing products. Recently, technologies such as machine learning and deep learning have been developed, and customers' data can be analyzed in various ways. Therefore, e-commerce is focusing on a more advanced personalization recommender system for sustainable development [40].
Netflix [3] proposed a personalization recommendation algorithm based on a deep neural network to build a video recommender system. Because of this personalization recommender system, Netflix has become a leader in movies and dramas. Spotify [41] has topped the music streaming market by offering personalization services. Spotify's services are Discover Weekly, which suggests new music every Monday, and Fresh Finds, which introduces songs by relatively lesser-known artists, and so on. Google [4] recommends news in real-time based on users' regions and interests. It also provides AI assistant services by learning users' life patterns. Amazon [42] started to provide personalization services by applying AI technology to its AI speakers and Amazon websites. Furthermore, Amazon has released some AI technologies as a service. Samsung provides automatic personalized services by analyzing users' living habits and usage environments through their smartphones. Alibaba [43] and Naver [44] have applied AI-based personalization services to search content.
Recently, the term hyper-personalization has emerged as an advance beyond personalization. The reason is that services that satisfy customers in e-commerce are becoming increasingly important. However, it is not easy to find empirical studies to examine the relation between personalization services in e-commerce and customer satisfaction. Therefore, this study aims to identify factors that affect customer satisfaction when they provide personalized recommender systems in e-commerce.

Methodologies in Recommender Systems
Recommender systems help users filter useless information to reduce information overload and provide personalized recommendations. E-commerce platforms have achieved great success in assessing customers' preferred products and improving their business profit. To enhance personalization capabilities, recommender systems are widely applied in many multimedia platforms targeting media products to specific customers. Since the early e-commerce platforms, the most representative analysis technique in recommender systems has been collaborative filtering (CF), which is reported to provide good performance despite its simple structure and ease of use [5,7,39,45]. The CF algorithm predicts customers' preferences by calculating similarities among customers or items [15,38,39].
CF algorithms are mainly divided into two categories: memory-based and modelbased [6]. Memory-based CF can be divided into user-based and item-based CF. Userbased CF calculates the similarity between customers by comparing their ratings on the same item [38]. It then computes the predicted rating for an item by the active customer as a weighted average of the item's ratings by customers similar to the active customer, where weights are the similarities of these customers with the target item. Item-based CF computes predictions using the similarity between items that are not the similarity between customers [13,15]. Model-based CF uses a user-item rating matrix to train a model with machine learning or data mining techniques to improve the CF algorithm's performance [6,46]. The trained model can then be used to provide recommendation lists for individual users. These techniques can quickly recommend a series of items because they use a precomputed model, and they have been proven to produce recommendation results similar to the neighborhood-based recommender system [13]. Algorithms that are often used in model-based CF include SVD (singular value decomposition), Bayesian networks, and neural networks [6,13,38]. However, an issue known as "cold start" accompanies CF, whereby the recommendations for new customers suffer from unpredictability because of a lack of historical data on their past purchases. Another issue known as "first start", in which recommendations cannot be made until a customer's preferences are reflected, is also widely prevalent [6,13]. In addition, as the volume of data increases, there is a scalability problem that reduces the CF algorithm's computational speed . Recently, many researchers have started to apply deep learning to recommender systems to maximize each method's advantages, supplement the disadvantages of CF algorithms, and effectively utilize various kinds of information [44,47]. A deep neural network (DNN) refers to a network of two or more hidden layers between the input and output layers [48]. This method uses sophisticated mathematical modeling to solve complex problems. Compared to traditional machine learning algorithms, it has been reported that DNNs have the advantage of being able to identify the potential structure of data [48]. Covington, et al. [49] proposed a recommendation algorithm based on DNN to build a video recommender system and showed that the proposed recommendation algorithm predicted 60% of video clicks on YouTube. Cheng, et al. [50] proposed an app recommender system for Google Play based on a DNN, and Okura, et al. [51] proposed a news recommender system based on a recursive neural network (RNN) and achieved good performance when applying it to Yahoo News. Since such a DNN-based recommender system shows an outstanding performance improvement over traditional recommender systems based on content-based filtering (CB), CF, and their hybrid methodologies, various attempts have been made to apply the DNN model to diverse recommendation problems [47]. The neural collaborative filtering (NCF) algorithm is one of the most typical models combining DNN and CF. NCF is trained by estimating the relationship between the user's latent vector and the latent vector of the item through the multilayer perceptron-based matrix factorization technique [52]. Therefore, in this study, we applied the most popular approaches, CF, SVD, and NCF, to develop a recommender system to identify which factors can affect customer satisfaction.

EDT and Customer Satisfaction
This study identifies factors that affect customer satisfaction when recommender systems are used on an e-commerce platform. To calculate customer satisfaction, we employ the EDT approach, which has been widely used in previous studies. EDT, which is used in various fields, is an extension model based on expectation-confirmation theory and the technology acceptance model (TAM) [53]. The EDT model is used in various studies to determine its impact on customer satisfaction and continuance intention in the latest technologies and online environments [53,54]. Continuation intention is influenced by customer satisfaction, determined by the difference between perceived quality and expectation levels. Consequently, customer satisfaction has a positive effect on continuance intention and word of mouth. According to EDT theory, the satisfaction that customers feel after purchasing products and services results from the following five stages [55]. First, customers shape their expectations for products and services through their experience. Second, they recognize the performance after using products and services. Third, they compare the performance with their expectations. If the performance is higher (or lower) than their expectations, a positive (or negative) disconfirmation will occur. Fourth, customers judge their satisfaction level based on these initial expectations and the resulting degree of disconfirmation. In other words, customers who have experienced positive disconfirmation are satisfied, while customers with negative disconfirmation are dissatisfied. Finally, the satisfied customer will then form the intention to repurchase or reuse the product or service, but dissatisfied customers will stop using it.
For example, Bhattacherjee [53] used expectation-confirmation theory to identify factors that influence customers' reuse intentions for online banking. McKinney, et al. [56] used EDT theory to measure web customer satisfaction in the information search stage of online shopping. Lin [57] proposed that EDT theory in e-commerce is an appropriate model for customer behavior because customer repurchase decisions are influenced by customer behavior. Based on EDT theory, Nevo and Chan [58] studied the effects of customer expectations and the desire for knowledge management systems on system satisfaction. Doong and Lai [59] used EDT theory to identify factors that influence the reuse intent of an e-negotiation system. In these studies, we can infer that EDT theory is suitable for a wide range of applications in which a comparison of customers' expectations of a product or system with the perceived performance plays an essential role in decision making. Applying a recommender system in e-commerce is directly related to sales and profit, so it is essential to develop or introduce a recommender system that fits customers' expectations. Whether a recommender system should continue to be applied to an ecommerce site can be determined by the disconfirmation between the customers' experience with the recommender system and their prior expectations.

Hypothesis 1: Accuracy of Recommendation
Customer satisfaction is important for maintaining a sustainable competitive advantage in e-commerce [60]. The customer who is satisfied with the recommendation service provided by an internet shopping mall tends to repurchase items at the ecommerce platform and recommend the recommendation service to his/her family, friends, and colleagues.
Algorithms for recommender systems were developed on the assumption that the satisfaction of customers increases as the accuracy of recommender systems increases [7,38,61,62]. Some researchers have shown that more accurate recommendations increase customer satisfaction [8,30,31]. Liang, et al. [63] empirically verified that user satisfaction with the recommender system can be increased depending on how accurate the recommendation provided is. In other words, more accurate recommendations increase the likelihood that customers will find items that suit their preferences, which in theory increases customer satisfaction. Therefore, reflecting the relationship between the accuracy of the recommender system and customer satisfaction, the following hypothesis is presented:

Hypothesis 2: Diversity of Recommendation
Providing new items or services to customers in e-commerce is related to the diversity of recommendations. The diversity of the recommendations is achieved by evaluating the ability of a recommender system to provide a diverse list of recommendations that the customer did not know [61]. It is known that if the accuracy of the recommender system is high, the customer satisfaction level is also high [63]. However, the satisfaction or reliability of the recommender system will decrease if the customer receives the same recommended item repeatedly. Some studies have claimed that accuracy was not the only consideration when measuring the quality of the recommendation [32][33][34][35]. Other studies argue that a more diverse list of recommendations increases the probability that a customer will choose the recommended item [12,32,64,65]. Thus, it is also important for recommender systems to provide a recommendation list consisting of diverse items as well as accurate items. In other words, the diversity of the recommendations decreasing the similarity of the items in the recommended list significantly improves customer satisfaction [24,35,66].Thus, the hypothesis is presented as follows:

Dataset Collection and Pre-Processing
We used MovieLens 1M (https://grouplens.org/datasets/movielens/1m/, accessed on 1 October 2020) and Amazon Product (http://jmcauley.ucsd.edu/data/amazon/, accessed on 1 October 2020), two publicly accessible datasets, for our experiments. The descriptive statistics of the two datasets are summarized in Table 1. The MovieLens dataset contains 1,000,209 ratings from 6,040 users on 3,706 items with a sparsity of 95.53%. This dataset includes a discrete scale of 1-5, where each user has rated at least 20 movies. The Amazon dataset contains 2,023,070 ratings from 1,210,271 users on 249,274 items with a sparsity of 99.99%. This original dataset is extensive but very sparse. For example, over 73% of users have rated only one item, making it difficult to evaluate algorithms. Therefore, the datasets were filtered in the same way as MovieLens datasets that held only users with 20 or more ratings. This results in a subset of the dataset that contains 2826 users and 42,042 items.

Evaluation Criteria of Accuracy, Diversity, and Customer Satisfaction
To measure the accuracy and diversity of recommendations as well as customer satisfaction, we adopted simple random sampling (SRS), which has been widely used in the literature [13,38]. We set 80% as a training dataset for each user and utilized the remaining dataset used to make predictions. The evaluation metrics depend on the method of recommendation approach. Accuracy metrics show how well the customer's actual and predicted preference fit, and diversity metrics show how well customers were recommended items that they had not previously purchased or expected. The metrics measuring accuracy are divided into statistical and decision-supporting accuracy metrics [67]. The former are employed for predictive algorithms, and the latter are employed for classification algorithms. In this study, to evaluate the performance of the recommender system, we employed the mean absolute error (MAE) and F1 score as metrics that have been widely used in the literature [61,67,68]. The MAE is a statistical accuracy metric that evaluates the quality of prediction by comparing the difference between predicted and actual ratings on test users, as shown in Equation (1). A lower MAE value is a more accurate recommendation prediction.
where n is the total number of recommendation items, ̂ is the predicted rating, and is the actual rating by the user for the item .
To understand whether users are interested in the recommendation list, we employ the precision, recall, and F1 score metrics, which are widely used in Top-K recommendation to evaluate the varying number of recommendation lists [33,61]. The F1 score is a balanced weighted average between precision and recall. A higher F1 score means a higher prediction ability of the recommendation system. The precision recall and F1 score for Top-K recommendations are defined in Equations (2)-(4). , where TP is true positive (item relevant and recommended), FP is false positive (item irrelevant and recommended), and FN is false negative (item relevant and not recommended). The available ratings are binary to differentiate relevant and irrelevant items.
Most recent studies have suggested measuring the diversity of recommended items as well to avoid a situation where many customers are referred to the same items [8,12,20]. There are several metrics for measuring the diversity of recommendations. In this study, we measured diversity using Shannon entropy (SE), which is widely used in several studies [69,70]. The SE is defined as follows: where pi is the percentage of the recommendation items containing the th item and n is the total number of items. Many customers post star ratings of items on e-commerce platforms that they have purchased. Star ratings are essential for predicting initial expectation levels for recommended items because the recommender system predicts the likelihood of customer purchases based on star ratings. Additionally, star ratings are important in measuring the performance following the purchase because high and low ratings indicate positive and negative views of items, respectively [71]. Therefore, we can define disconfirmation as the average of the differences in users' actual ratings and predicted ratings. Disconfirmation is defined as follows: where m is the total number of recommendation items, ̂ is the predicted rating, and is the actual rating by the user for the item . We calculated customer satisfaction for each test user and reported the average score.

Build Several Types of Recommender System
To test the research hypotheses, we developed ItemKNN, SVD, and NCF algorithms, which are the most popular algorithms of recommender systems [10,24]. The simulation experiments were programmed using the Surprise and Keras libraries. All experiments were carried out on a system with an i9-9900 KF CPU @3.60 GHz with 64 GB RAM. The three types of recommender system methods can be described as follows:

ItemKNN
This method is the standard item-based CF that is based on neighborhood models in recommender systems [10,14,68]. We followed the setting of the existing literature to adapt it to an explicit dataset [2,72]. The most common item-based CF is a similarity measure between items, where ( , ) denotes the similarity of item and item . Many studies have measured similarity based on the Pearson correlation coefficient [13,73]. The similarity between item i and item j is calculated as follows: where , represents the rating of user u for item i and is the average rating of the ith item. In this method, the goal is to predict -unobserved values by user for item . Calculates the sum of ratings given by the user for items similar to i to predict item i for user u. Each rating is weighted by the corresponding similarity sim(i,j) between items i and j [73]. The predicted rating is taken as a weighted average of the ratings for neighboring items defined as follows:

SVD
Recently, the matrix factorization model has gained popularity because of its high accuracy and scalability [10,13,24]. This study will focus on methods that are induced by the SVD of the user-item interaction matrix. SVD is the most popular approach for estimating the interaction component in the matrix factorization technique that reduces the number of features in a dataset by reducing the space dimension from a high-level dimension to a low-level dimension [24,38]. Accordingly, each item i is associated with a latent vector V, and each user u is associated with a vector U. Typically, this method is applied to explicit feedback datasets while avoiding overfitting through a regularized model [74,75]. The SVD model is defined as follows: where U and V are the number of latent factor users and items, respectively, and is used for regularizing the model. Y is the available ratings set, and M is the binary mask.

NCF
In general, the traditional latent factor model uses a simple vector dot item for estimating the relationship latent vector. Therefore, the model cannot produce good results [47,76]. To overcome the limitations of the existing technique, this method is trained by estimating the relationship between the latent vector of user and the latent vector of item through the multilayer perceptron [47,77]. The user embedding and item embedding are provided in a multilayer neural structure to map latent vectors to prediction scores. Finally, the dimensions of the last hidden layer N determine the functionality of the model. The output layer is the predicted rating, and the model training is performed by minimizing the loss between the predicted rating and its actual rating [47,52]. The training model followed the parameter settings of existing studies [52,77]. The NCF predictive model is defined as follows: where and denote that the input layer consists of two feature vectors. and denote the latent factors for the user and item, respectively, and denotes the parameter of the model. W and b represent weight matrices and bias vectors, respectively.

Impact of Predictive Factor Size
In this section, we study the impact of factor size change on the predictive performance of the recommender system with the MovieLens dataset. To determine the optimal number of factors, we performed several experiments that set several factors from 5 to 100. For the SVD and NCF algorithms, the number of factors is equal to the number of latent factors. For ItemKNN algorithms, we performed experiments with several neighborhood sizes and reported the best performance. Figure 2 shows the results of the experiments. The results show that the predictive performance of the ItemKNN algorithm increased as the neighborhood size increased. The SVD algorithm does not change much as the number of factors increases. In the NCF algorithm, after a certain factor, the improvement gains diminished, and the quality of prediction worsened. For each algorithm, the quality of prediction was great when the number of factors was 50, 50, and 10. Thus, we performed several other experiments to determine the optimal number of item recommendations when the number of factors was optimized.

Impact of Number of Recommendation List
To determine the optimal accuracy and diversity, various studies were conducted on several recommendation lists that varied from 5 to 100 at the optimized number of factors. The results are shown in Figures 3-5. In all recommender system algorithms, it can be observed from the figure that accuracy (F1 score) and diversity (Shannon entropy) improve with the increasing number of recommendation lists. For each algorithm, the accuracy was highest when the number of recommendation sizes was 100, 90, and 100, whereas diversity continued to increase with the recommendation list's increasing length. The diversity was highest when the number of recommendation sizes was 100 on all algorithms. In other words, the total number of unique items increased as the length of the recommendation list increased. These results showed that both accuracy and diversity are optimized for recommender systems such as the ItemKNN and NCF algorithms when the number of recommendation sizes is 100. Furthermore, the SVD algorithm's accuracy and diversity are optimized when the number of recommendation sizes is 90. Therefore, we tested the hypothesis at the optimized number of factors and the number of recommendations.

Experimental Results
The mean and standard deviation for accuracy, diversity, and customer satisfaction at the MovieLens Dataset are listed in Table 2. The mean values for accuracy and diversity were between 0.5146 and 0.6927 and between 1.0560 and 1.1628, respectively. Furthermore, the mean value of customer satisfaction was between 0.4820 and 0.6204. The highest value of accuracy is for the NCF algorithms (0.6927), and the lowest value of accuracy is for the ItemKNN algorithms (0.5146). The highest value of diversity is at the NCF algorithm (1.1628), and the lowest value of diversity is at the SVD algorithm (1.0560). The highest value of customer satisfaction is at NCF algorithms (0.6204), and the lowest value of customer satisfaction is at ItemKNN algorithms (0.4820). To test the research hypotheses proposed above, we performed multiple regression analyses (MRA), using customer satisfaction as a dependent variable and the accuracy and diversity of recommendations as independent variables under simulation output data. Table 3 summarizes the results of MRA for hypotheses H1 and H2 in the MovieLens Datasets. In Table 3, for the ItemKNN and SVD algorithms, the significant factors of customer satisfaction are both accuracies (p < 0.001). The effect of diversity of recommendation is not significant for ItemKNN and negatively affects customer satisfaction (p < 0.001) for SVD algorithms. The regression model explains 14.3% and 1.9% of the variance in profitability, respectively. For the NCF algorithms, the significant factors of customer satisfaction are both accuracy (p < 0.001) and diversity (p < 0.05). The regression model explains 25.7% of the variance in profitability. For the ItemKNN and SVD algorithms, the results show that accuracy positively and significantly affects customer satisfaction, supporting hypothesis 1. For the NCF algorithms, both accuracy and diversity positively and significantly affect customer satisfaction, supporting Hypothesis 1 and Hypothesis 2. Additionally, a one-way analysis of variance (ANOVA) was conducted to determine whether there was a significant difference in accuracy, diversity, and customer satisfaction for each recommender systems on the MovieLens datasets. The Scheffé Post Hoc Test was used to identify multiple comparisons of group means. The results presented in Table 4 indicate a significant accuracy (F = 2.002, Sig. = 0.048), diversity (F = 13.873, Sig. = 0.000), and customer satisfaction (F = 4.428, Sig. = 0.003) difference among the recommender systems.

Impact of Predictive Factor Size
As in the MovieLens dataset experiment, to determine the optimal number of factors, we performed several experiments that set the factor number from 1 to 100. Figure 6 shows the results of the experiments for the Amazon dataset. The results show that the predictive performance of the ItemKNN algorithm increased before maintaining a certain level of accuracy as the neighborhood size increased. The SVD algorithm decreased minutely as the number of factors increased. In the NCF algorithm, after a certain factor, the improvement gains diminished, and the quality of prediction worsened. For each algorithm, the quality of prediction was great when the number of factors was 60, 5, and 5.

Impact of Number of Recommendation List
To determine the optimal accuracy and diversity, a variety of studies were conducted on several recommendations lists that varied from 5 to 100 at the optimized number of factors for each algorithm. The results are shown in Figures 7-9. In all recommender system algorithms, it can be observed from the figures that accuracy (F1 score) and diversity (Shannon entropy) improve with the increasing number of recommendation lists. For each algorithm, the accuracy was highest when the number of recommendation sizes was 70, 40, and 40, whereas diversity continued to increase with the increasing size of the recommendation list. The diversity was highest when the number of recommendation sizes was 90, 80, and 90. In other words, the total number of unique items increased as the size of the recommendation list increased. These results show that both accuracy and diversity are optimized when the number of recommendation sizes is 70, 40, and 50, respectively.

Experimental Results
The mean and standard deviation for accuracy, diversity, and customer satisfaction at Amazon Datasets are listed in Table 5. The mean values for accuracy and diversity were between 0.6797 and 0.7797 and between 0.6826 and 0.7162, respectively. Furthermore, the mean value of customer satisfaction was between 0.6550 and 0.6911. The highest value of accuracy is at the ItemKNN algorithm (0.7797), and the lowest value of accuracy is at the SVD algorithm (0.6797). The highest value of diversity is at the NCF algorithm (0.7162), and the lowest value of diversity is at the ItemKNN algorithm (0.6826). The highest value of customer satisfaction is at NCF algorithms (0.6911), and the lowest value of customer satisfaction is at SVD algorithms (0.6550). As in the experiment above, we performed MRA using customer satisfaction as a dependent variable and the accuracy and diversity of recommendations as independent variables under simulation output data for Amazon datasets. Table 6 summarizes the results of MRA for hypotheses H1 and H2 for the Amazon datasets. In Table 6, for the ItemKNN and SVD algorithms, the significant factors of customer satisfaction are both accuracies (p < 0.001). The effect of recommendation diversity is not significant. The regression model explains 35.9% and 29.2% of the variance in profitability, respectively. For the NCF algorithms, the significant factors of customer satisfaction are both accuracy (p < 0.05) and diversity (p < 0.05). The regression model explains 16.7% of the variance in profitability. For the ItemKNN and SVD algorithms, the results show that accuracy positively and significantly affects customer satisfaction, therefore supporting Hypothesis 1. For the NCF algorithms, both accuracy and diversity positively and significantly affect customer satisfaction, therefore supporting Hypothesis 1 and Hypothesis 2.  Additionally, a one-way analysis of variance (ANOVA) was conducted to determine whether there was a significant difference in accuracy, diversity, and customer satisfaction for each recommender systems on Amazon datasets. The Scheffé Post Hoc Test was used to identify multiple comparisons of group means. The results presented in Table 7 indicate a significant accuracy (F = 0.170, Sig. = 0.001), diversity (F = 1.265, Sig. = 0.014), and customer satisfaction (F = 6.170, Sig. = 0.000) difference among the recommender systems.

Results and Discussion
The purpose of this study is to examine the effect of recommendation accuracy and diversity on customer satisfaction when recommending products or services to customers in the e-commerce industry. Many e-commerce global companies, such as Amazon, Google, and Netflix, offer personalized recommendation services to maintain a sustainable competitive advantage. However, there is a trade-off between the accuracy of recommendations and the diversity of recommendations and continuing debates about which factors have a significant impact on customer satisfaction. Thus, we applied the most popular ways to approach recommender systems and investigated which factors affect customer satisfaction through a series of experiments with publicly available datasets widely used to evaluate recommender system performance. Finally, to test the hypotheses, MRA was conducted using customer satisfaction as a dependent variable and the accuracy and diversity of recommendations as independent variables.
The finding of this study is as follows. First, we employed EDT to measure customer satisfaction with the most popular recommender systems algorithms for the first time. The existing EDT study was limited to the concept of the individual level, and limited data collection has been mainly conducted based on questionnaires. We performed several experiments utilizing two datasets that contain the phenomenon of the entire market for measuring customer satisfaction. Second, we identified the factors that affect customer satisfaction. In traditional recommender system algorithms such as ItemKNN and SVD, the results showed that accurate recommendations positively affected customer satisfaction, which showed the same result for the two different datasets. In the deep learning-based recommender system, the effects of customer satisfaction after a recommendation on recommendation accuracy and diversity of recommendation were found to be significant. These results can be interpreted in the following way. Traditional recommendation algorithms such as ItemKNN and SVD obtain a list of recommended items from neighbors similar to the target user, and since most of the significant users tend to be fixed as most of the users' neighbors, it is often difficult to recommend various products [5,7]. However, since NCF is a deep learning method, it can be assumed that various products are recommended from various neighbors through much more computation.

Theoretical Contributions and Practical Implications
This study provides theoretical contributions to the recommendation performance aspects and the customer attitude aspects for customer evaluation on the personalized recommendation service. First, there has been a lot of study on recommender systems since the late 1990s. However, most previous studies on personalized recommendation services have focused on improving accuracy performance [5,7,[14][15][16][17]. However, when service recommends the same product every time, customer satisfaction will decrease even if the recommender system's accuracy is high [13,24]. Other studies suggest that pursuing diversity while maintaining a certain level of recommender system accuracy can increase customer satisfaction [25,26]. In other words, there is an accuracy-diversity dilemma with personalized recommendation services [8,[27][28][29]. Therefore, the study on personalized recommendation services focuses on enhancing the recommendation performance. However, customer satisfaction with the personalized recommendation services is just as important as improving system performance. Nonetheless, few studies have considered the relation between recommendation performance and customer satisfaction. However, recommendation performance and customer satisfaction are likely to form complex causal relationships, and more complex research methodologies are needed to account for these causal relationships. This study collects market-level real ecommerce datasets to describe the complex causal relationships among various variables through simulation experiments. Furthermore, it contributes to expanding the scope of research on personalized recommendation services by using the concept of customer satisfaction for personalized recommendation services that have been difficult to see in previous studies. Second, the previous measuring customer satisfaction research was limited to the concept of individual level, and limited data collection has been mainly conducted based on questionnaires. However, with IT technologies, including the Internet, market-level data are being collected in various fields. To utilize the data that contains the phenomenon of the entire market, it is necessary to apply various theories at the market level. Therefore, we adopted the EDT approach, which is widely used in online e-commerce to identify the accuracy of recommendations, diversity of recommendations, and customer satisfaction at the market level. This study contributes to expanding customer satisfaction studies utilizing market-level diverse datasets.
Finally, the experimental results of this study provide the following implications for decision-makers or practitioners in the e-commerce field. First, the existing recommender systems provided products based on customers' purchase history, aiming to increase the system's accuracy. This is because they believe that customers are satisfied when products or services are correctly recommended. However, if a customer is referred to similar products or services each time, he/she will be less satisfied with the recommender system. This study suggests that there is room for rethinking existing business strategies by statistically verifying that the accuracy and diversity of recommended items affect customer satisfaction. Most existing recommender systems of e-commerce platforms widely use traditional algorithms such as ItemKNN and SVD, thus suggesting an increase in sales volume by providing items that meet customer preferences because the recommendation's accuracy can increase customer satisfaction. On the contrary, the deep learning-based recommender systems such as NCF algorithms suggest that sales volume could be increased by providing various items that meet customer preferences because pursuing diversity while maintaining accuracy can increase customer satisfaction. Second, as the e-commerce market has grown recently, the results of this study have implications for new e-commerce sites and existing large e-commerce sites. For the factors related to customer satisfaction identified in this study, related companies should closely investigate these factors and find other factors related to customer satisfaction. The results of this study can be used as a basic reference for e-commerce sites to reduce unnecessary costs and losses in terms of data collection and recommender system development and to suggest the direction of super-personalized services.

Limitations and Future Research
Nevertheless, there are several limitations to this study. First, our experiments were conducted using a movie and product dataset only. A generalization of this study results requires further experiments using datasets from various domains. Second,, we conducted experiments with traditional algorithms and deep learning algorithms. However, the experimental results show that the deep learning algorithm performs better than traditional algorithms. Therefore, further study is needed on whether this study's results will hold when various other deep learning algorithms, such as a convolutional neural network (CNN) and recurrent neural network (RNN), are used. Finally, this study identified the factors of accuracy and diversity of recommendation affecting customer satisfaction. In an e-commerce company, other evaluation metrics, such as serendipity and novelty, can also be essential factors in customer satisfaction. Therefore, future studies are needed to confirm the relationship between customer satisfaction and other evaluation metrics with a series of experiments with real datasets.