Using the Least Squares Support Vector Regression to Forecast Movie Sales with Data from Twitter and Movie Databases

: Due to the rapid prominence and popularity of social media, social broadcasting networks with voluntary information sharing have become one of the most powerful ways to spread word-of-mouth opinions, and thus, have inﬂuence on consumers’ preferences toward products. Therefore, sentiment analysis data from social media have become more important in forecasting product sales. For the movie industry, the opinions expressed on social media have increasing impacts on movie sales. In addition, some databases, such as the Box O ﬃ ce Mojo and Internet Movie Database (IMDb), contain structured data for predicting movie sales. Thus, three categories of data—data of movie databases, data of tweets, and hybrid data including movies databases and tweets—are employed symmetrically in this study. The aim of this study is to employ the least squares support vector regression (LSSVR) to forecast movie sales worldwide according to these three forms of data. In addition, three other forecasting techniques—namely, the back propagation neural network (BPNN), the generalized regression neural network (GRNN), and the multivariate linear regression (MLR) model—were used to forecast movie sales with the three types of data. The empirical results show that the LSSVR model with hybrid data can obtain more accurate results than the other forecasting models with all data types. Thus, forecasting movie sales using the LSSSVR model with data containing movie databases and tweets is a feasible and prospective method to forecast movie sales.


Introduction
Due to the booming popularity of the Internet, people have become accustomed to expressing opinions through social media, which has subsequently become a crucial communication channel among consumers. Therefore, social media data is one of the most essential tools in learning consumers' preferences, and data gathered from social media have become very important in terms of data sources. Text is one of the major data forms appearing on social media, such as Twitter and Facebook, and thus, sentiment analysis is an effective way to obtain some insight from text on social media. Pai and Liu [1] used the sentiment analysis of tweets and stock market values to predict vehicle sales, and the numerical results indicated that the use of hybrid data can result in a satisfying forecasting performance. Kang et al. [2] employed tweets and semantic analysis to investigate vaccine opinions, which were divided into positive, negative, and neutral, and the results showed that the use of semantic analysis is an effective method to learn vaccine willingness. Giatsoglou et al. [3] used four review datasets to forecast the emotions of comments on websites in Greek and English, where an in-house Greek dictionary and an English emotional dictionary from the Word-Emotion Association Lexicon were used as basic databases to analyze emotion words and sentences in Greek and English. Xu et al. [4] developed a self-learning convolutional neural network framework for clustering short texts and optimal clusters, which can be obtained by K-means approaches. The experimental results revealed that the proposed framework is an effective and flexible method to cluster short text datasets. Tran and Kavuluru [5] utilized short textual descriptions of symptoms from historical psychotic patients to predict a set of common mental conditions, where two deep neural networks, namely, convolutional neural networks and recurrent neural networks with hierarchical attention, were employed in the investigation. The numerical results showed that the proposed methods provide feasible and effective ways of analyzing the short textual history of symptoms for psychiatric evaluation. Oliveira et al. [6] employed Sina microblog sentiment data to forecast stock market indices, including return on investment, stock volatility, and trading volume, and the numerical results indicated that microblogging data is able to predict stock market behaviors. Leitch and Sherif [7] used sentiment Twitter scores to study the relationship between the successions of chief executive officers and stock returns in both the U.K. and U.S.A, and the results of that investigation illustrated a positive and highly significant relationship between CEO age at announcement and stock returns. Li et al. [8] analyzed sentimental texts from Chinese microblogging systems to predict the emotions of emergency disaster-related events, where the emotional lexicon was collected from COAE2014 (6th China Opinion Analysis and evaluation). Experimental results revealed that the developed model was feasible and effective for these two real-world datasets. Poria et al. [9] designed a deep convolutional neural network with linguistic patterns to analyze customers' opinions of products or services by aspect extraction, and the experimental outcomes indicated that the proposed model could obtain more accurate results than the other state-of-the-art techniques. Corea [10] employed English tweets to predict the emotions of the stock investors of three companies, namely, Apple, Facebook, and Google, and the results showed that the posting volume of tweets is an essential factor in increasing the accuracy of forecasting. Huberty [11] designed a naive Bayes classifier, and used tweets to predict the election results in the United States, Germany, and other democratic countries. The numerical results revealed that the proposed model can achieve very satisfactory forecasting accuracy, and thus, is a feasible way to predict election results. Li and Wu [12] integrated support vector machines and K-means to analyze the emotion of comments from the Sina sports forum, and obtained the emotional polarities of the texts. The experimental results revealed that the forecasting accuracy generated by the developed model is satisfactory.
The literature of forecasting movie sales is sequentially addressed as follows: Ma et al. studied the influences of movie reviews on movie sales, and reported that advertising reviews have more impact on the movie opening time; however, the influence decreases within two weeks after the release of the movie. Ru et al. [13] employed LSTM networks to forecast daily box office performance with both dynamic data and static data, and the numerical results indicated that LSTM networks outperformed the multilayer perceptron neural networks and support vector regression models in terms of forecasting accuracy. Baek et al. [14] studied the impacts of four social media sites on box office revenues in the early and later stages of movie opening periods, and the findings revealed that Twitter has more of an influence on box office revenue in the early stage of movie opening periods, while Yahoo! Movies has a greater impact in the late stage of a movie's opening. In addition, this study showed that there are no impact differences between blogs and YouTube on box office revenues in both the initial and the later periods. Lee et al. [15] investigated the impact of emotional entropy on the relationship between word-of-mouth and movie box office sales, and found that the strength of entropy obtained from reviews positively influences the relationship between word-of-mouth and movie box office sales. Ding et al. [16] studied the impact of "likes" provided by Facebook on box office sales. The "likes" were collected in five different time intervals before movies were released, and the results indicated that "likes" have a highly positive impact on box office performance. Hur et al. [17] integrated machine learning approaches with the independent subspace method to forecast box office performance by using the sentiments of movie reviews. They found that the proposed models were able to obtain accurate and robust forecasting results for different forecasting periods. Kim et al. [18] employed machine learning-based techniques with social network service data to predict box office performance, where genetic algorithms were applied for selecting the essential input variables for the proposed models. The empirical results revealed that the designed box office performance forecasting framework has achieved obvious improvements in terms of forecasting accuracy. Gopinath et al. [19] investigated the influences of pre-and post-release blog volume, blog valence, and advertising on opening day box office performance one month later in various geographic markets. Multivariate regression models were employed to analyze the collected data, and the numerical results revealed that the number of blogs and advertisements are critical for box offices in the pre-release periods, while the blog valence and user ratings are important for box offices in the post-release periods. Rui et al. [20] used the dynamic panel data model, support vector machines, naive Bayesian methods, and tweets to predict the impact of word-of-mouth on movie sales, and found that its influence was proportional to the number of followers. Karniouchina [21] analyzed the impact of buzz on movie distribution and box office income, and the results showed that it helps movie fans to anticipate the film before its release. Chakravarty et al. [22] investigated the influence of online users' comments on the box office of forthcoming movies, and three hypotheses were tested. This study concluded that positive reviews may be drowned out by negative reviews, and thus, negative reviews are crucial and should receive greater attention. Mishne and Glance [23] employed correlation analysis with blogger sentiment to predict movie sales, and the results indicated that sentiment could be an effective variable in forming a model for predicting movie sales. Liu [24] studied the influence of word-of-mouth on movie box office revenues, and the numerical results revealed that the impact of word-of-mouth is relatively essential in the movie's prerelease week and the first opening week. In the previous literature review, the positive influences of social media data on forecasting tasks was predominantly examined. However, influences of social media data, structured data and hybrid data on movie sales predictions were rarely investigated. In addition, the least square support vector regression [25] has been a prevailing technique in dealing with multivariate regression problems [26]. Thus, the aim of this study is to employ the least square support vector regression to predict movie sales by using different data types symmetrically. Three other forecasting models were employed to deal with the same data sets to compare and analyze the results. The rest of this study is organized as follows: Section 2 introduces the least square support vector regression method and the architecture of forecasting movie sales. The numerical results are demonstrated in Section 3, and Section 4 elucidates conclusions.

The Least Square Support Vector Regression
To reduce computation complexity, the least square support vector regression presented an improvement from the support vector regression by solving a linear problem, instead of dealing with the convex quadratic programming problem. The support vector regression [27][28][29] originates in support vector machines [30,31], which are designed to solve binary classification problems, and then, extended to regression functions. While the LSSVR has been applied in dealing with regression forecasting problems [26], it has received little attention for forecasting movie sales in the multivariate form. The LSSVR is briefly explained as follows: for a training data set TD, including m data points, where x m and y m represent the input data and output data, respectively. The least square support vector regression can be transformed into an optimization model for representation, which is expressed as follows [25]: Minimize : subject to where w is the weighted vector or the norm of the hyperplane, C is a regularization factor that trades the minimization of the estimation error off against the function smoothness, µ i is the error variable, ∅(x i ) is the mapping function, mapping x i from the original space into a high dimension feature space, and b represents a bias parameter. By using the Lagrange multiplier method, the optimization problem can be reformulated to find solutions of w and µ, as follows: where λ i indicate Lagrange multipliers. By partially differentiating L with respect to variables w, b, µ, and λ, s, and setting all partial derivatives equal to zero, the solution of the problem can be obtained according to the Karush-Kuhn-Tucker conditions [32][33][34]. Thus, Equations (4)-(7) are derived.
Sequentially, solving Equations (4)-(7) by the least squares method, the solution of the LSSVR can be generated in the following form: where K x i , x j is the kernel function satisfying the Mercer's condition [35]. Some options, such as the Gaussian kernel function, the polynomial kernel function, and the sigmoid kernel function, are candidates for kernel functions. The Gaussian kernel function, as represented by Equation (9), was used as the kernel function for this study. Figure 1 illustrates the proposed architecture for forecasting movie sales. Both structured data and unstructured data were collected in this study. The structured data include data from Box Office Mojo and the Internet Movie Database, and the unstructured data were gathered from tweets related to the investigated movies. From the Box Office Mojo website (http://www.boxofficemojo.com), ranks, titles, release dates, worldwide box office, distributors, genres, MPAA (Motion Picture Association of America) ratings were collected, while two data, runtime and budgets, were collected from the IMDB (http://www.imdb.com/) website. As some values of runtime and budget cannot be obtained from Box Office Mojo during the study period, these two data sets were generated from IMDB. The ranks of movie sales were used to select the top 150 movies in terms of worldwide movie sales from 2010 to 2017. Moreover, this study collected movies released on Fridays in U.S. time, and a total of 128 movie data were gathered. In addition, movie titles and release dates were employed for collecting tweets. The worldwide box office was treated as the dependent variable. The other five data sets, namely, distributors, genres, MPAA ratings, runtime, and budget, served as a set of independent variables. Movies titles were used as keywords to collect tweets three days before the films' release dates. Figure 2 shows the time period of tweets collection. The sentiment scores of tweets, as calculated by SentiStrength [36,37], were the other set of independent variables. Before conducting SentiStrength, the data preprocessing procedure of tweets was performed. Only comments in tweets were collected, thus, texts with the same contents as advertising texts were deleted. The noisy data of dates, user names, forwarding numbers, websites, single quotation marks, semicolons, and symbols were filtered out. The cleaned tweets were employed by SentiStrength, and sentiment scores were calculated. SentiStrength provides positive sentiment scores from 1 to 5 and negative sentiment scores from −1 to −5 to indicate the various positive and negative sentiment strengths of the texts, and assigned scores. The score of 0 does not exist, thus, the scores 1 and −1 represent neutral sentiments. Table 1 shows the statements of the variables used in this study, where three types of data, namely, data of movie databases, data of tweets, and hybrid data, were employed to predict movie sales. The hybrid data consists of the data of movie databases and the data of tweets.

Numerical Results
The results of the predicted movie sales are presented in this section. The average absolute percentage error (MAPE) and the root mean square error (RMSE), illustrated as Equations (10) and (11), respectively, were used to investigate the forecasting performances of the forecasting models. Figures 3 and 4 illustrate the average MAPE and RMSE values of the 10-fold cross-validation, as provided by the four forecasting models. Table 2 indicates the averages of the MAPE and RMSE of the four forecasting models.  Furthermore, a 10-fold cross-validation procedure was conducted in this study to investigate the forecasting performances of forecasting models. The data number is 13 for the eight data subsets and 12 for the two data subsets. Three other forecasting models, namely, the back propagation neural network [38], the generalized regression neural network [39], and the multivariate linear regression method, were used to deal with the same data sets in this study. In this study, the architecture of the back propagation neural network is one hidden layer with ten hidden nodes. The genetic algorithms [40] were employed to determine the parameters of the least squares support vector regression, the back propagation neural networks, and the generalized regression neural networks, by using training forecasting errors as the fitness function. The parameters selected by the genetic algorithms of the three forecasting models were the regularization factor and the width parameter of the Gaussian kernel function of the LSSVR models, the learning rate and the momentum of the BPNN models, and the smoothing parameter of the GRNN models. In this study, the parameters are represented by a chromosome including 40 genes in the form of binary numbers, the population size is 10, and the crossover and mutation rates are 0.5 and 0.7, respectively.

Numerical Results
The results of the predicted movie sales are presented in this section. The average absolute percentage error (MAPE) and the root mean square error (RMSE), illustrated as Equations (10) and (11), respectively, were used to investigate the forecasting performances of the forecasting models. Figures 3  and 4 illustrate the average MAPE and RMSE values of the 10-fold cross-validation, as provided by the four forecasting models. Table 2 indicates the averages of the MAPE and RMSE of the four forecasting models.

Conclusions
This study proposed a framework using data from Twitter and movie databases to predict movie sales with several forecasting models. Genetic algorithms were employed to determine the parameters of the least squares support vector regression, the back propagation neural network, and the generalized regression neural networks. Two single data types, collected from movie databases and tweets, and one hybrid data type, including movie databases and tweets, were used to examine the influences of various data types on different forecasting models. The numerical results indicated that using the LSSVR with GA to forecast movie sales can result in the best forecasting performance in terms of prediction accuracy for the three data types with the four forecasting models. The superior prediction performance of using the LSSVR with GA in forecasting movie sales is most likely due to the use of hybrid data and the forecasting capability of LSSVR models. Thus, using the least squares support vector regression model to forecast movie sales by data from Twitter and movie databases is a feasible and promising alternative in predicting box office performance. The superior performance of LSSVR with GA in predicting movie sales in this study can be concluded as follows: First, the LSSVR is able to capture the nonlinearity of multivariate regression in forecasting box office performance. Secondly, the addition of tweets and sentiment analyses [41,42] does improve the forecasting performance of LSSVR models. In this study, using only movies databases can result in better forecasting performances than using only data from Twitter for the four forecasting models. Moreover, using only movie databases can generate more accurate forecasting results than using the hybrid data with the GRNN and MLR models. Thus, this finding indicates that the traditional structured data, such as movie databases, cannot be underestimated for some models in forecasting movie sales. However, limitations of this finding arise from methods used in this study. Only four models were employed to analyze the forecasting movie sales by different data types. Possibly a more general conclusion could be reached by applying more forecasting models to cope with the same data sets.
For future work, the expansion of data collection in both structured and unstructured data to improve forecasting performance may be a possible direction. For structured data, in addition to movie databases, some global economic indicators could be included. In the unstructured data aspect, other social media, such as Instagram, Facebook, and the comments of movie trailers on YouTube, could be gathered to forecast box office performance. In addition, the effectiveness of social media data on the forecasting accuracy improvement for different problems is crucial. Thus, another possible direction for future study could be to analyze influences of social media data, structured data and hybrid data on forecasting accuracy for various problem domains.