Estimation of Economic Indicator Announced by Government From Social Big Data

We introduce a systematic method to estimate an economic indicator from the Japanese government by analyzing big Japanese blog data. Explanatory variables are monthly word frequencies. We adopt 1352 words in the section of economics and industry of the Nikkei thesaurus for each candidate word to illustrate the economic index. From this large volume of words, our method automatically selects the words which have strong correlation with the economic indicator and resolves some difficulties in statistics such as the spurious correlation and overfitting. As a result, our model reasonably illustrates the real economy index. The announcement of an economic index from government usually has a time lag, while our proposed method can be real time.


Introduction
By analyzing social big data such as blogs, twitter, etc., many statistical properties of human behaviors are uncovered scientifically. Typical examples are power law growth and relaxation of social interests around social events such as Christmas [1,2], spreading properties of false rumors and fake news [3,4], relationships between stock prices and social data [5,6], predicting consumer behavior with Web search [7] and forecasting macroeconomic index with Google Trends [8]. It is a big challenge for mathematical scientists to establish data analysis methods and propose mathematical models to describe and predict the phenomena. Currently, much attention is being paid to the topic of finding significant correlations between phenomena in the real world and in cyberspace.
In this paper, we examine the relation between a composite index (CI) and big blog data. The CI that characterizes the Japanese economy is announced by the Japanese government every month. Although many people want to know the latest economic condition quickly, it takes a few months to calculate CI values based on the observed quantities in many fields such as production, inventory, investment, employment, consumption, business management, finance, prices and services. In this paper, we show that the value of CI can be roughly estimated by the appearance of carefully chosen words in cyberspace. As the analysis using data in the cyberspace can be conducted in real time at lowers costs, it is possible that CI could be approximated by applying this new method for quick announcement.
We expect that online textual data such as blogs and Twitter are related to economic indices. Recently, many diverse people have begun to use social media and the data posted on such sites immediately reflect social events and human activities. For example, earthquakes and typhoons can be detected quickly and accurately by using Twitter data [9]. It detected 96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or higher and announced more quickly than the announcements that were broadcast by the JMA. In addition, power law growth and relaxation were observed in blog frequency with words such as "April Fool" or "Christmas" around these social events. The future frequency of such words could be predicted by the properties [1].
Nowcasting and forecasting based on social big data are effective because social data in cyberspace reflect public opinion and behavior in physical space. However, there are difficulties because of the large volume of data. For example, in the case of a huge number of candidates, such as 10,000 variables, if we assign the threshold of p-value as 0.01, then approximately 100 variables would remain, even if these candidates followed independent randomness.
We also have to pay attention to the following points in conducting correlation and regression analyses. First, the Pearson correlation coefficient is defined as follows: where Z represents an average of Z and ρ has a value between +1 and −1. If ρ > 0, the correlation is positive; conversely, if ρ < 0, the correlation is negative. Z x and Z y are Z-scores defined as: which are normalized to an average 0 and standard deviation 1. The Pearson correlation coefficient is a useful tool in describing the relationship between two variables; however, when we apply it, we should be aware of the following assumptions.
• Two variables follow a linear relationship.

•
Averages and standard deviations are well-defined.
Namely, if two variables do not follow a linear relation, or if variables (x, y) include outliers that significantly affect the correlation coefficient, we cannot trust the coefficient. In addition, in cases where the distributions of x or y follow power law distributions, (P(x) ∝ x −β , 0 < β ≤ 2), the standard deviation tends to diverge when the number of samples increases. Therefore, we need to consider that the Pearson correlation coefficient is meaningless in this context. Second, we focus on the possibility that a correlation may be spurious. For instance, we assume three variables, e.g., temperature, and the sales of ice cream and soft drinks. Calculating correlations among these values, we probably find positive correlations. We consider when the temperature rises, people tend to buy more ice cream and soft drinks to provide relief from the heat; however, it is not reasonable to assume that the sales of soft drinks are increasing because the sales of ice cream are rising. In this case, we find the positive correlation between sales of ice cream and soft drinks. However, this could happen even if there is no relation among these variables, in which case we call the correlation a spurious correlation. When we detect a strong correlation, we are inclined to consider the existence of a linkage that includes a causal relationship, which often leads to an incorrect result. Therefore, when we deal with multiple variables, we need to introduce a process which excludes spurious correlations.
Third, when we apply regression analysis, we must pay attention to the following points: Overfitting is the case where we take many variables to reproduce details of training data, however, the model does not have the same performance for the test data, which are not used in the training. If we run regression analysis to non-stationary time series, we often have a large coefficient of determination, which is called a spurious regression [10,11]. Moreover, if more than two explanatory variables are a strongly correlated, the relationships between the explanatory variables are not independent. Therefore, optimal regression coefficients are not uniquely obtained, which is called multicollinearity. In this case, the reliability of the regression coefficients is remarkably reduced.
In cases where we handle multivariate variables with non-normal distribution and non-stationary dynamics, it is important to carefully check their appropriateness for the Pearson correlation coefficient and regression analysis. In the next section, we introduce a method which removes these difficulties and estimates the economic index using big blog data.

Development of a New Economic Index Based on a Comprehensive Search
In this section, we introduce a new method that systematically finds words that are correlated with respect to an economic index, and it reconstructs the index by using a regression formulation with the frequency of selected words.
The method, which resolves all the problems mentioned in the Introduction, consists of the following four steps, as shown in Figure 1: data pre-processing, variable selection, regression analysis and smoothing.
In this analysis, we used three types of data: blogs, dictionary of candidate words and the time series of an economic index. As for blogs, we used "Kuchikomi@kakaricho" (http://kakaricho.jp) to obtain the daily frequencies of words, which covers about 1.8 billion blogs from 1 January 2007 to 31 October 2015. We adopted 1352 words listed in the economics and industry section of the Nikkei thesaurus (http://t21.nikkei.co.jp/public/help/contract/price/20/thesaurus/field_B.html) to illustrate the economic index. In this paper, we focus on the index of business conditions, especially Composite Index (CI) coincidence, which is announced monthly by the Japanese government. The CI provides a quantitative description of the Japanese economy.

Data Pre-Processing
We analyzed the daily frequency of each of the 1352 words from 1 January 2007, which is described by r i (t day ) where i is an index of words and t day is daily time. Here, r i (t day ) is defined as follows: where w i (t day ) is the number of blog entries including the ith key word, and W(t day ) is the total number of blogs on t day . Namely, r i (t day ) is a value by normalizing w i (t day ) with W(t day ) to remove the effects such as trend of the entire blog and the day of the week effect because people tend to write blogs on the weekend. We also calculate monthly average frequency as, to match with the economic index intervals which is announced monthly. Here, m is the number of days in one month: 28, 29, 30 or 31. We excluded the words with 0 in frequency over 10 months to remove very low frequency words. Thus, 1127 candidate words remained.

Extraction of Words by One-Body Correlation
First, we introduced an algorithm that chooses the words with similar patterns and trends in the time series of frequency. Specifically, as shown in Figure 2, we gave a "+" sign if the value is greater than the median of the time series, while we used a "−" sign if the value is smaller. There were four patterns with respect to both values of CI and word frequency, which are greater or smaller than the median: ( This definition is the same as that of probability in Fisher's exact test. In this analysis, we imposed the restriction of dividing the median, and therefore the probability p does not have exactly the same meaning as the original p in Fisher's exact test. In this analysis, we considered that, if p < 10 −3 , the variables are regarded as sufficiently correlated. As a result, we found 416 words that have small p-values that show strong correlations with CI. The Top 4 words with the smallest p-values are shown in Table 1.

Grouping
The words selected in the previous method are likely to have very similar patterns and trends. If we apply a regression analysis using all the words, we have the multicollinearity problem and overfitting problem mentioned in the Introduction. To solve these problems, we categorized the words into some groups which have similar patterns and trends.
First, we assigned a number (1, 2, 3, · · · ) in ascending order of the no correlation index, p (in descending order of having strong correlation). Next, we calculated the value p defined in Equation (5) for ith and jth words (i < j) as well as shown in the extraction of words by one-body correlation, and, if the value p was smaller than the threshold value (10 −6 ), the time series of frequencies of two words was judged as very similar and the jth word was classified into the ith word's group. By applying this algorithm to pairs in descending order, i.e., 1st-2nd, 1st-3rd, · · · , 2nd-3rd, 2nd-4th, etc., we formed a group of words. Figure 3 shows an example of the grouping results. We can observe the time series of pairs: for example, "hukeiki'' (recession) and "hukyou taisaku'' (economic recovery policy), and "gaikoku ginkou'' (foreign banks) and "shihon shizyou" (capital market) are comparable. After the grouping process, we classified the 416 words into 43 groups whose representative words have the lowest p in the group. :

Round Robin (Detection of Spurious Correlation)
Based on Bayesian analysis for Dirichlet distribution [12], we defined a probability in which both frequency of ith words and CI have a larger value than the median, where a = (+, +) and b = (+, −), as shown in Section 2.2.1, and r is the probability that CI has a larger value than its median estimated from whole data: where c = (−, +) and d = (−, −). We also define corresponding standard deviation as In the cases where the frequency of the ith word has a lower probability than the median, we defined the following values as well as the frequency that has a larger value than the median: To introduce conditional probabilities with respect to the jth word, we defined the following quantities, a = (+, +), b = (+, −), c = (−, +), d = (−, −) as the numbers that CI and the frequency of the ith word take a greater value than each median under the condition where the jth word has a bigger value compared to its median in the same month. The probability and standard deviation in the condition are defined below: In addition to a , b , c and d , we defined a = (+, +), b = (+, −), c = (−, +), d = (−, −), which describe the numbers that both CI and the frequency of the jth word have a larger value than each median in the condition where the ith word has a bigger value compared to its median. We also defined q i|j (a , b ), q¯i |j (a , b ), σ i|j (a , b ), σ¯i |j (a , b ) in the same way as Equations (11)-(14).
Using these values, we defined the effectiveness of the ith word to illustrate CI, k i|j , in a condition where the frequency of the jth word is bigger than the median: Similarly, we defined the following quantity with the condition that the frequency of the jth word is smaller than its median: If these values, which were conditioned by the jth word, are small, the correlation between the ith word and CI is likely to be spurious correlation caused by the jth word. We introduced the following quantity, which characterizes the strength of the correlation between the ith word and CI taking into account the condition of the jth word.
Some round robin results of P(i : j) are shown in Table 2. If P(j : i) is smaller than P(i : j), we considered that the jth word has a spurious correlation caused by the ith word, because the correlation with a condition by the ith word against the jth word is smaller than a condition by the jth word against ith word. To evaluate the asymmetry, we defined an asymmetrical index: If D(i : j) has a larger value than a given threshold (−0.6 in this analysis), we judged that the ith word has a spurious correlation caused by the jth word. For example, regarding the pair "keiki taisaku" (economic measures) and "hukeiki" (recession), D("keiki taisaku" (economic measures) : "hukeiki" (recession)) = 7.6 − 31.7 7.6 + 31.7 = −0.61.
We regarded "keiki taisaku" (economic measures) as having spurious correlation against "hukeiki" (recession). Therefore, "keiki taisaku" (economic measures) was classified into the group of words which is represented by the word, "hukeiki" (recession). After classifying the words, we computed the following sum of P(i : j) In the regression analysis, we adopted, as an explanatory variable, the descending order of S(i).

Visualization of Relationships Between Variables
To intuitively understand the results obtained in the previous steps, such as the extraction of words by the one-body correlation, grouping and round robin (detection of a spurious correlation), we show the relationships between words in Figure 4. In this figure, the solid line with both sides arrow indicates relationships with a strong correlation between CI and word frequency. These words passed the judgments: grouping and detection of spurious correlation. There are 17 representative words and, in this figure, we show the Top 7 words. The solid line without an arrow describes the words which are grouped in the representative words, and the broken line with an arrow indicates a spurious correlation.

Regression Analysis
We designed a model that reconstructs CI by conducting a regression analysis with frequency of words such as "hukeiki" (recession), "gaikoku ginkou" (foreign banks), etc., which were selected in the previous analysis. We applied this model to estimate the daily CI based on the daily frequency of words.
We performed a linear regression analysis using CI as a dependent variable and the frequency of words as explanatory variables: where c 0 is a constant and m means the number of representative words. We used top m words with respect to the order of S(i) in Equation (20). We then obtained coefficients c k with the minimum square error: Figure 5 depicts the regression results using one, three and seven variables. In the case of using only one variable, "hukeiki" (recession), big jumps in 2008 and 2009 were reproduced, and, by adding other variables to our model, more details were reproduced. The regression coefficients in the case of 7 words are shown in Table 3. Besides the results of in-sample data, our model yielded reasonable results for the out-sample data from January to October 2015.  In the above variable selection analysis, we obtained 17 representative words from 1352 words. We then introduced the method that allowed us to systematically decide the appropriate number of explanatory variables. As discussed in the Introduction, we applied the regression analysis to non-stationary time series CI. To check for the spurious regression problem, we proposed a hypothesis testing which compared the frequency of words and an artificial time series which follows the random walk model. A random walk time series was generated by randomly sampled numbers from the same distribution with real time series. For example, an absolute value of differences (step size) of the first random walk was randomly chosen from the absolute value of the differences in "hukeiki" (recession) and we then gave positive or negative signs randomly. The kth random walks were produced using the absolute value of differences of the top kth word as well as the first one. Figure 6 shows the results of fitting CI by seven random walks. In Figure 6a, the fitting did not work well and R 2 is low, where R 2 is defined as follows: Here, N is the number of data. In Figure 6b, the random walk regression reconstructed the properties of CI with R 2 = 0.83. These result indicate that random walk sometimes reproduce non-stationary time series. Figure 6. Regression results of CI by seven random walks. In the case of (a), R 2 = 0.35, and in the case of (b), R 2 = 0.83. Figure 7 shows the probability density function of R 2 for 10,000 samples by the regression of seven random walks. The red line in this figure shows the result in Figure 5c. The results in Figure 5c had a greater R 2 value compared to the random walks in most cases, and the probability of the random walks model having a higher R 2 was approximately 3%. Using the distribution of R 2 by the random walk regression, we performed a statistical test. The null hypothesis is the following: "The random walks regression is more effective than the regression by the time series of selected words with respect to R 2 ". Figure 8 shows the relationships between the coefficient of determination (R 2 ), the number of explanatory variables in the proposed model, and the random walk model with the Top 5 and 10 percentiles. The values of the coefficient of determination increases with respect to the number of explanatory variables. In this case, the R 2 of the proposed method was higher than the case of the Top 5 percentile of the random walk model until seven variables. This means that the null hypothesis was rejected with a significant level of 5% and proposed model was better than the random walk model statistically. However, in the case of more than eight variables, the null hypothesis with a significance level of 5% was not rejected. Therefore, we considered the proposed method with seven variables appropriate. The number of explanatory variables is determined as where R 2 k (proposed method) shows the coefficient of determination of the proposed method with the top k explanatory variables and R 2 k (RW 5%) is the coefficient of determination of the random walk model in the top 5% percentile.

Daily Index and Smoothing
The daily index I(t day ) can be calculated by the regression coefficients and daily frequency of words (r k (t day )), I(t day ), as shown in Figure 9a, suddenly increased in response to the news related to the words. To reduce this effect, we applied the optimal moving average [13]. The optimal moving average (Ī(t)) of I(t day ) is calculated as follows: where n = 11 and the coefficients (w k ) are shown in Figure 9b. It has a large weight in a small k and it has a small peak at k = 7. We considered it a weekly period, and it almost converged to 0 around k = 11. Comparing I(t day ) andĪ(t day ) (Figure 9a), we found that sudden peaks are reduced in the case ofĪ(t day ), and it fluctuates around the monthly economic index (CI).

Discussion
In this section, we summarize our method. First, we systematically found the words from many candidates (1352 words) that had strong correlations with the composite index (CI) coincidence by using Fisher's exact test which is a non-parametric method and has robustness for outliers. We then applied grouping and round robin (detection of spurious correlation) to reduce the number of candidates which had a strong correlation with other candidates or a spurious correlation; As a result, we obtained 17 representative words. We graphically demonstrated the relationships between the words which helped to obtain an intuitive understanding of the results of the variable selections. We then proposed the regression method for the time series, which widely fluctuated. Our model reasonably reproduced the economic index using seven representative words which are statistically determined by the proposed method. We also showed the daily index by applying the regression results and the optimal moving average method. The announcement of an economic index by the government usually delayed because of the time required to gather and analyze data. The proposed method could reduce this time lag, because it is possible to collect blogs and calculate the index everyday.
The proposed method could be applied not only in estimating economic indicators but also in situations where we find good explanatory values to illustrate an explained value, and it is one of the most important tasks in analyzing big data.