Online User Review Analysis for Product Evaluation and Improvement

: Traditional user research methods are challenged for the decision-making in product design and improvement with the updating speed becoming faster, considering limited survey scopes, insufﬁcient samples, and time-consuming processes. This paper proposes a novel approach to acquire useful online reviews from E-commerce platforms, build a product evaluation indicator system, and put forward improvement strategies for the product with opinion mining and sentiment analysis with online reviews. The effectiveness of the method is validated by a large number of user reviews for smartphones wherein, with the evaluation indicator system, we can accurately predict the bad review rate for the product with only 9.9% error. And improvement strategies are proposed after processing the whole approach in the case study. The approach can be applied for product evaluation and improvement, especially for the products with needs for iterative design and sailed online with plenty of user reviews.


Introduction
To improve enterprises' competitiveness and satisfy the increasing consumer demands, the updating speed of products has become faster [1,2]. A novel way of product innovation-Iterative Design of Product (IDP) [3]-is created. IDP refers to the process of dividing a product's update design process into multiple iterative short-term phases, and through multiple-round improvements based on user feedback to update the product more quickly to meet the user needs [4,5]. Therefore, the period of product design and manufacture is much shorter than before, and the analysis of user feedback is becoming a vital step.
Traditional user research and product planning take a long stage, and due to the restrictions of time and effort, traditional methods have their limitations, e.g., limited survey scope, insufficient user feedback, and time-consuming. And after long-term research/planning, unpredictable problems can arise in the development process. Furthermore, when the product goes on the market, market reaction is also difficult to predict. Therefore, it is critical to excavate and analyze the feedback and demands of consumers in a fast and effective manner for IDP [6].
With the rise of E-commerce and social networks, both the types and sizes of data associated with human society are increasing at an unprecedented rate. These data are more and more influential in research and industrial fields, such as health care, financial services, and business advice [7][8][9]. Numerous online reviews related to products are becoming an effective data stream of user research. Accordingly, the approach that relies on small-scale data to discover the rules of unknown areas tends to be replaced with big data analysis [10]. Compared with traditional user research methods, big-data-driven methods are more efficient, cost-effective, and with wider survey scope. Moreover, online reviews are spontaneously generated by users without interference, thus more reliable and precise to represent the user needs [9,11]. With online reviews, designers no longer need to act passively to user feedback but can explore the potential value of the user data spontaneously.

Theoretical Framework
Recent studies have found that manufacturers can make use of online review data to make favorable decisions and gain competitive advantages [12,13]. And there are plenty of studies applying nature language techniques or machine learning techniques to mining user requirements from online reviews, such as review cleaning, information extraction, and sentiment analysis [14].
For review cleaning, it is to remove noise reviews including advertisements, meaningless reviews, malicious reviews, etc. Although big data analysis has many advantages over traditional user research, the usefulness analysis of data is becoming more and more important due to the continuous increase in the volume and generation speed of data [15]. By eliminating useless data, the data can become more valuable [16]. Banerjee et al. [17] believe that the characteristics of reviewers (e.g., enthusiasm, participation, experience, reputation, competence, and community) has a direct impact on the usefulness of reviews. Karimi et al. [18] believe that visual cues are useful for the identification of online reviews. Forman et al. [19] study the usefulness of reviews in the e-commerce environment, taking the professional knowledge and attractiveness of the reviewers as the two criteria for evaluating the usefulness of reviews.
Obtaining valuable information from useful reviews is another valuable research direction. These studies focus on automatically review comprehension aiming to reduce labor costs. Dave et al. [20] conduct a systematic study on opinion extraction and semantic classification, and the experimental results are better than machine learning methods. Based on Chinese reviews, Zhang et al. [21] detect the shortcomings of different brands of cosmetics through the method of feature extraction algorithm, helping manufacturers improve product quality and competitiveness. Novgolodov et al. [22] extract product descriptions from user reviews using deep multi-task learning.
Sentiment analysis for online reviews is the method to figure out users' attitudes and opinions for certain aspects. And the method is widely implemented for public opinion monitoring, marketing, customer services, etc. Jing et al. [23] establish a two-stage verification model and compare the data analysis results of the two stages, proving that the sentiment analysis method is more effective than traditional user research. In addition, Kang et al. [24] use the sentiment analysis method to analyze six features of mobile applications, such as speed and stability, achieving good results. Yang et al. [25] proposed a novel approach for Chinese E-Commerce product reviews based on deep learning.
These studies mostly solve the online review issues from the technique perspective aimed at improving accuracy and obtaining better results for different situations. However, few studies systematically consider the application of these techniques, i.e., how can the results contribute to product design and improvement process for manufacturers. As the enterprise resources are limited, mass information or user requirements derived from online reviews cannot be implemented completely. Therefore, it is important to analyze the improvement priority for different aspects of the product. Moreover, enterprises require specific suggestions for product design and improvement, and information extraction or sentiment analysis can somehow help but is not enough.

Aims and Research Questions
In this study, we propose a novel approach for product evaluation and improvement based on online review text mining and sentiment analysis. The approach aims at solving the strategy orientated issue for product design and improvement based on online user reviews. Concretely, with the approach we can analyze the usefulness of online reviews by extracting product attributes and user emotion, evaluate the priority of product attributes to improve by evaluation indicator system, and provide improvement strategies.
To validate the effectiveness of the method, we take the smartphone as an example for the case study as the smartphone is one of the products that have varieties of types with fast iteration rate. The specific research questions (RQ) for the case study are: • RQ1 What are the results for product attribute acquisition and useful review analysis? • RQ2 Does the evaluation indicator system works for product evaluation and what's the accuracy for evaluation system for priority? • RQ3 What are the results for product improvement strategies, i.e., what's the valuable improvement ideas refined?
Although the case study takes the smartphone as an example, the method can definitely be applied to other products as long as to adjust the product attributes for different products. Thus, the research can generally contribute to manufacturers for product design and improvement with applying systematical strategy suggestions.

Materials and Methods
In this section, we introduce the main steps of the proposed method to systematically analyze online user reviews and provide improvement ideas for products. The overall framework of the method is shown in Figure 1, which mainly contains three steps:

1.
Useful review acquisition. We apply a web crawler system to collect online reviews and preprocess the reviews with Latent Dirichlet Allocation (LDA) [26] to obtain product attributes and user emotion. Those reviews that contain both attribute words and emotion words are considered as useful reviews.

2.
Product evaluation indicator system establishment. With the product attributes and user emotion, we establish multi-dimensional indicators (e.g., users' satisfaction and attention) to evaluate the priority of product attributes for improvement.

3.
Improvement strategy analysis. By selecting negative reviews for target product attributes, with the technology of text mining, we can find the dissatisfaction and propose improvement strategies.

Useful Review Acquisition
The first step is to acquire product attributes from reviews and conduct useful review analysis. A product attribute is extracted and summarized from reviews to reflect a series of related attribute words. For example, the product attribute of appearance may contain attribute words, such as colors, shapes, materials, sizes, etc.

Product Attribute Acquisition
To get product attributes, the current tends to select attribute words with higher statistic word frequencies artificially according to existing knowledge. These product attributes are considered to be the ones users are concerned about. However, in some reviews, the attribute words will appear more than once, which will mislead the statistics of word frequencies. Therefore, it is necessary to add an artificial screening method, but the results may be subjective.
To solve the problems, LDA topic extraction method is applied. LDA is a topic extraction method that can summarize the topic for one document, representing the product attribute for one review, so it is used to extract product attributes after tokenization for reviews. The main idea of LDA is to regard each review as a mixed probability distribution of all topics, in which the topics are subjected to a probability distribution based on words. Based on this idea, the LDA model can be acquired by adding the Dirichlet priority. The LDA module from scikit-learn-a Python package-is used in our research. However, the results from the LDA module have noises. For example, some topics only contain adjectives, adverbs, and verbs, and they are deleted manually. The rest of the topics are user-concerned and represent product attributes. Each topic is constituted by a series of words describing one of the product attributes, and these words are defined as attribute words. By setting different numbers of product attributes (K), the similarities among different product attributes are calculated. The smaller the similarity value, the smaller the quantity of repeated words among the product attributes, and the corresponding K is more appropriate [27]. With the method, we can determine the appropriate number of topics to represent product attributes.

Useful Review Analysis
Whether a review is useful is determined by if the review contains both attribute words and the user's emotional inclination towards the product attributes (emotion words). The attribute words are obtained in the previous step of product attribute acquisition, while emotion words are obtained with HowNet [28], which is a common sense knowledge base to reveal the meaning for concepts and relationship between concepts in Chinese and English established by a famous expert in machine translation Mr. Dong Zhendong, with decades of efforts. HowNet has been many scholars used in the study of Chinese text, and the emotion word thesaurus in HowNet is now one of the most famous thesauri for Chinese emotion words. In this paper, attribute words obtained above and emotion words in HowNet are used as the basis to evaluate the usefulness of reviews.
The method for review usefulness analysis is formulated in (1): where r i is the evaluation result for the i-th review, r a i represents if the i-th review contains attribute words, and r e i represents if the i-th review contains emotion words. The three parameters are all binary that 1 means 'yes' and 0 means 'no'. And if the i-th review both contains attribute words and emotion words, r i equals to 1 representing the review is useful.

Product Evaluation Indicator System
Based on the attribute words in product attributes, the useful reviews are annotated with one or several product attributes if the reviews contain corresponding attribute words, since in one review, the user can express multiple product attributes.
To figure out how the users evaluate the product, we apply the sentiment analysis tool TextBlob [29] to calculate the sentiment score of an individual review expressing user emotion, and the result represents the individual user satisfaction. The sentiment score of the i-th review in the j-th product attribute is marked as s ij , which is ranged from [−1, 1], where −1 means extreme dissatisfaction, while 1 means extreme satisfaction.
We construct three indicators for product evaluation-User satisfaction (S), User Attention (A) and Priority (P). As the three indicators demonstrate concrete product attributes, each product attribute will have the three indicators. And to distinguish the overall indicators and concrete indicators for each product attributes, we add subscript for the abbreviations (e.g., for the 1-st product attribute, User satisfaction is marked as S 1 ). And details are as follows:

1.
To understand users' satisfaction (S j ) with the j-th product attribute, the average emotion value is calculated as Equation (2), where N j means the number of reviews in the j-th product attribute group. High S j represents that the users are generally satisfied with the j-th product attribute. 2.
The second indicator for product evaluation is users' attention to individual product attributes. If a review comments on one product attribute, then it is considered that the reviewer is concerned about the product attribute. To measure the user attention (A j ) on the j-th product attribute, we calculate the proportion of reviews in the corresponding product attribute group as Equation (3), where N is the total number of useful reviews. High A j represents that the j-th product attribute is hotly discussed in user reviews and gains high user attention. 3.
For one product attribute, the manufacturer aims to make more users more satisfied, which is measured by maximizing S j * A j . However, due to the limited resources of the enterprise, the decision that maximizes the benefits of the enterprise is the optimal decision. Therefore, it is important to select several high-priority product attributes for product improvement first. And the priority (P j ) is calculated by the evaluation promotion space measurement as Equation (4), which means that lower satisfaction and higher user attention make higher evaluation promotion space.

Improvement Strategy Analysis
Through the evaluation indicator system, we can roughly determine the directions (product attributes) for product improvement. However, it is not enough since lacking concrete improvement strategies. To mine the details for the improvement of product attributes, we refine the opinions of negative reviews on related attributes with Baidu's AipNlp, which is a leading NLP (Natural Language Processing) toolkit for Chinese, so as to analyze the dissatisfaction of users. The specific analysis steps of negative review mining are as follows: 1.
Obtain related reviews in the target product attribute groups.

3.
Apply text mining methods to extract users' opinions and then manually adjust the results.

Case Study and Result Discussion
In this section, we conduct a case study on the smartphone with online user reviews applying the proposed approach and discuss the results for the three main research questions. For the structure of the section, we firstly introduce the review datasets used and then discuss the research questions.

Data Collection and Preprocessing
In China, Jingdong (JD.com) and Taobao (taobao.com) are the most well-known Ecommerce platforms. As Jingdong adopts a B2C (Business to Costomer) model, there is less internal competition than Taobao-a C2C (Consumer to Consumer) platform. Thus, Jingdong has fewer fake reviews, and the data is more reliable for analysis.
We crawl a total of 1,257,482 online reviews from the top 60 best-sold smartphones. To ensure the accuracy of review data, we preprocess the crawled reviews by deleting duplicated reviews, reviews containing advertisements, and reviews with punctuation marks, and, finally, get 1,189,357 reviews.

Results for Useful Review Acquisition (RQ1)
For RQ1, we firstly analyze the product attributes for the smartphone. For determine the K parameter of LDA to obtain appropriate number of topics, we increase K one by one, and find that (1) when K is less than 18, the similarity decreases with K; and (2) when K is greater than 18, the similarity increases with K. Therefore, when K equals 18, the similarity among the product attributes is the lowest. Therefore, We set K to 18 and get 18 topics from the LDA outputs, and each topic represents one product attribute. However, three of them only contain adjectives, adverbs, and verbs. These three topics cannot represent product attributes, thus being removed manually.
And for the aim of topic interpretation, we invited two professional smartphone designers from Huawei to discuss the product attributes for smartphones and interpret attribute words in each topic with one of the product attributes. Combined with the classification standards of smartphone attributes on the official websites of well-known enterprises, such as Apple [30] and Huawei [31], the remaining 15 topics are finally interpreted with product attributes. And the explanations for the 15 product attributes are list in Table 1. The number of attribute words corresponding to each product attribute is shown in Table 2.
After getting product attributes and attribute words for the smartphone, we process the useful review analysis and finally get 808,426 useful reviews based on Equation (1). The network performance of the smartphone. appearance The shape, color, and material of the smartphone. operating system iOS or Android, and some applications and functions in system.

audio & video
Audio and video playback quality.

Product Evaluation Indicator System Establishment and Validation (RQ2)
As a result, we establish an evaluation indicator system for the smartphone with 15 (product attributes) * 3 (indicators) parameters, and the three indicators are S (User Satisfaction), A (User Attention), and P (Priority).
To answer RQ2, it is to examine the effectiveness of the evaluation indicator system about the accuracy of the evaluation for the product. In Jingdong, users can give an overall evaluation of the product with good/medium/bad reviews. The bad review rate is an important indicator to demonstrate the urgency of improvement for the product that high bad review rate means that there is much room for improvement.
The bad review rate thus can be regarded as the ground truth for the evaluation indicator system. Based on the principal component analysis (PCA) [32], we apply multiple linear regression analysis between evaluation indicators and bad review rate. If the model can accurately predict the bad review rate with evaluation indicators, the effectiveness of the evaluation system is verified.

Principle Component Analysis
We choose the comprehensive indicator P for the validation as the validation for P can demonstrate the value of S and A. Firstly, we apply the Pearson correlation coefficient [33] to measure the correlation between the P of 15 product attributes and the bad review rate on 60 smartphones. If the coefficient value >0.3, it is considered that there is a correlation between the two sets of data, and if the value >0.5, the correlation is strong. The result for the Pearson correlation coefficient is shown in Table 3, and we can conclude that P for 7 product attributes-battery, service, price, durability, screen, data connection, and operating system have correlation with bad review rate.
Then, we apply PCA to decompose the 7 product attributes to reduce the complexity of linear regression analysis. The result of PCA is shown in Table 4. The variance contribution of the first principle component (PC) and second PC is high and the cumulative variance contribution of the two PC reaches 76.889%. With the analysis of eigenvector in Table 5, we find the first PC mainly represents operating system, data connection, service, battery, screen, and durability, while the second PC mainly represents price. The two PC are in line with the 'performance' and 'cost' pursued by customers, therefore, we select the two PC for linear regression.

Multiple Linear Regression Analysis
After the dimension reduction process for the 7 product attributes, we apply multiple linear regression with the 2 PC and bad review rate. In 60 smartphone samples of the first and second PC with the bad review rate, we randomly choose 2/3 samples for linear model training and the rest 1/3 for testing. The result for the multiple linear regression equation is: where y is the predicted bad review rate, and x 1 and x 2 are the first and second PC. Generally, if 0.5 < R 2 < 0.7, it is indicating that the effect size is moderate, and if R 2 > 0.7, it is considered strong effect size [34]. We believe that the R 2 indicates that the model has a good fitting effect with the data. And the p-value for the significance test of the overall regression effect of the model demonstrates the statistical significance for the multiple linear regression. For model testing, we calculate the predicted bad review rate for the rest 20 smartphone samples and then compare the predicted results with ground truth data, which is listed in Table 6. The difference between the predicted bad review rate and true value is small, with the mean absolute percentage error (MAPE) of 9.9%. According to the interpretation of MAPE values by Montaño et al. [35], if MAPE is less than 10%, the model is highly accurate predicting. The result shows that with multiple linear regression model based on evaluation indicators has good predictability for the bad review rate, which validates the effectiveness of the evaluation indicator system.

Improvement Strategy Analysis for Smartphones (RQ3)
In this part, we firstly analyze the evaluation indicators for smartphones to figure out those product attributes that need the most improvement and then propose concrete improvement ideas.

The Analysis of Evaluation Indicators for Overall Smartphones
To understand the general evaluation for the smartphones on the market, we analysis the overall review data for 60 smartphones, and calculate S, A, P by Equations (2)-(4). Table 7 shows the results of overall evaluation indicators.
By analyzing the results, it is found that users are most satisfied with appearance, price, and processor, while least satisfied with data connection, packaging list, and audio & video. For user attention, operating system and appearance are the most concerning attributes, while packaging list, size & weight, and data connection are the least. The result may be due to the fact that appearance and operating system are the two product attributes that users are most exposed to, which has the greatest impact on user experience. For the priority, the most urgent product attribute for improvement is the operating system, followed by appearance and service.

The Analysis of Evaluation Indicators for Individual Smartphones
The evaluation indicator system can also be applied for one smartphone as long as filtering reviews for the target smartphone. Here, we choose iPhone X for the case study. As a product commemorating the tenth anniversary of Apple's iPhone, iPhone X sets a direction for the development of smartphones with its full display screen, innovative interactive mode, and unique Face ID, which sparked heated discussion among users.
The results of the evaluation indicators for iPhone X are shown in Table 8. It is found that iPhone X users are most satisfied with size & weight, price, and appearance, while least satisfied with audio & video and packaging list. And Users are most concerned about operating system and service, while least concerned about storage and packaging list. Finally, the most urgent product attribute for improvement is the operating system, followed by appearance and service, which is as same as the overall results. Taking iPhone X as an example, the three product attributes with the highest priority are operating system, service, and appearance, so we apply the three steps on the three product attributes, and the results are shown in Table 9. With the combined analysis of mining results and corresponding reviews, the users' dissatisfaction with operating system lies in: the system is easy to collapse, unstable, bugs, slow boot, frozen, crash, overheating, slow response, insensitive, few functions, slow App download speed, poor software compatibility, etc.
The main opinions of dissatisfaction with service are poor customer service attitude, unprofessional, slow user event processing speed, slow delivery speed, poor courier service, poor after-sale service, cumbersome steps, etc. And the main problem is with the customer service, which is accounted for more than half of the negative reviews.
The dissatisfaction with appearance mainly lies in: the notch screen is not comfortable, ugly appearance, not good-looking, small size, dull color, no personality, less color for choice, general workmanship, defects, etc.
After finding out the users' dissatisfaction, we put forward specific improvement strategies based on the opinions of experts. The specific improvement strategies for iPhone X are as follows: 1.
Enhance the user experience of operating system: improve the stability, fluency, and compatibility of various App versions; Optimize speed and response speed of operating system, accelerate boot speed; Implement features to optimize the temperature control system; Enhance user experience for the game environment to improve the efficiency and fluency of games; Increase the construction of domestic server of App Store to speed up App updating and downloading.

2.
Enhance customer service professionalism: improve customer service attitude, accelerate user event processing speed, simplify customer service process, and increase the staff of customer service; Improve the delivery services: speed up the delivery of products and transport speed, improve the delivery service attitude of couriers.

3.
Enhance the personality of the smartphone: enrich the color of the product to provide a wealth of choices; Improve the standard of smartphone styling design in line with the current consumer aesthetic, reduce or eliminate the notch screen; Provide large size models to give consumers a variety of choices; Improve the workmanship of products: improve the production quality of OEM factories and reduce defects of smartphones.

Discussion
Overall, this study combined the techniques of big data, text mining, and sentiment analysis to acquire useful information from online user reviews for product evaluation and improvement. And the main contribution is that we establish a systematical approach to guide the manufacturers to analyze online user reviews, determine high priority product attributes to be improved, and propose concrete improvement strategies. The study is based on various finding in other studies like the validation of value of online user reviews for product development [36], product feature extraction methods and sentiment analysis of online reviews [37], and review helpfulness ranking [38]. These studies provide adequate technical and theoretical support for the establishment of product evaluation indicator system. And the product evaluation indicator system is validated with a linear regression model to predict true user satisfaction (bad review rate). As a consequence, we get the R 2 of 0.537 and MAPE of 9.9% and compared to Zhao et al. [39] that, likewise using online reviews to predict user satisfaction (rating) and getting 0.385 R 2 , we believe our product evaluation indicator system can accurately reflect users' evaluation on the product. For other studies aiming at extract improvement ideas from online reviews [40,41], our approach consider the resource limitation of manufacturers and thus analyze the priority of different product attributes for improvement. Therefore, the improvement ideas acquired are the most urgent and important for product improvement based on our algorithm and have more practical meanings.

Conclusions and Future Work
Traditional product design uses questionnaires and interviews to obtain user feedback. However, these survey methods take a longer time, have smaller survey scope, and higher research costs, resulting in a longer design cycle. The advent of the big data era has provided a new idea for user opinion mining for product improvement. Designers can better understand user needs and make decisions through big data analysis and sentiment analysis. This paper proposes a product evaluation indicator system to support product improvement with combined methods of big data analysis and sentiment analysis. The proposed method can make a comprehensive and validated evaluation of the indicators of the products based on big data, get more user feedback within a shorter period, chase the users' preferences more efficiently, and find the improvement strategies for the products.
Nevertheless, there are a few limitations for our study. Firstly, there are many manual steps in product attribute acquisition that will influence the results. Secondly, the improvement strategy mainly derives from user opinions which neglects the innovation of products due to the big data analysis is based on mass users. In addition, the cost for the product improvement is not considered in the strategy, but, practically, it is a factor that enterprises must take into account. Future studies can focus on the following directions:

1.
Improve the automation of review analysis so that the method can be quickly migrated to other products.

2.
Consider the tradeoff computing for the pursuit of user opinions, innovation, and cost.

3.
Apply the improvement strategies to real production practice to validate the methods. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data are not publicly available due to privacy or ethical.