You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

6 November 2018

Estimation of Economic Indicator Announced by Government From Social Big Data

,
and
1
Institute of Innovative Research, Tokyo Institute of Technology, 4259, Nagatsuta-cho, Yokohama 226-8502, Japan
2
Sony Computer Science Laboratories, 3-14-13, Higashi-Gotanda, Shinagawa-ku, Tokyo 141-0022, Japan
3
Department of Mathematical and Computing Science, School of Computing, Tokyo Institute of Technology, 4259, Nagatsuta-cho, Yokohama 226-8502, Japan
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Economic Fitness and Complexity

Abstract

We introduce a systematic method to estimate an economic indicator from the Japanese government by analyzing big Japanese blog data. Explanatory variables are monthly word frequencies. We adopt 1352 words in the section of economics and industry of the Nikkei thesaurus for each candidate word to illustrate the economic index. From this large volume of words, our method automatically selects the words which have strong correlation with the economic indicator and resolves some difficulties in statistics such as the spurious correlation and overfitting. As a result, our model reasonably illustrates the real economy index. The announcement of an economic index from government usually has a time lag, while our proposed method can be real time.

1. Introduction

By analyzing social big data such as blogs, twitter, etc., many statistical properties of human behaviors are uncovered scientifically. Typical examples are power law growth and relaxation of social interests around social events such as Christmas [1,2], spreading properties of false rumors and fake news [3,4], relationships between stock prices and social data [5,6], predicting consumer behavior with Web search [7] and forecasting macroeconomic index with Google Trends [8]. It is a big challenge for mathematical scientists to establish data analysis methods and propose mathematical models to describe and predict the phenomena. Currently, much attention is being paid to the topic of finding significant correlations between phenomena in the real world and in cyberspace.
In this paper, we examine the relation between a composite index (CI) and big blog data. The CI that characterizes the Japanese economy is announced by the Japanese government every month. Although many people want to know the latest economic condition quickly, it takes a few months to calculate CI values based on the observed quantities in many fields such as production, inventory, investment, employment, consumption, business management, finance, prices and services. In this paper, we show that the value of CI can be roughly estimated by the appearance of carefully chosen words in cyberspace. As the analysis using data in the cyberspace can be conducted in real time at lowers costs, it is possible that CI could be approximated by applying this new method for quick announcement.
We expect that online textual data such as blogs and Twitter are related to economic indices. Recently, many diverse people have begun to use social media and the data posted on such sites immediately reflect social events and human activities. For example, earthquakes and typhoons can be detected quickly and accurately by using Twitter data [9]. It detected 96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or higher and announced more quickly than the announcements that were broadcast by the JMA. In addition, power law growth and relaxation were observed in blog frequency with words such as “April Fool” or “Christmas” around these social events. The future frequency of such words could be predicted by the properties [1].
Nowcasting and forecasting based on social big data are effective because social data in cyberspace reflect public opinion and behavior in physical space. However, there are difficulties because of the large volume of data. For example, in the case of a huge number of candidates, such as 10,000 variables, if we assign the threshold of p-value as 0.01, then approximately 100 variables would remain, even if these candidates followed independent randomness.
We also have to pay attention to the following points in conducting correlation and regression analyses. First, the Pearson correlation coefficient is defined as follows:
ρ = Z x · Z y ,
where Z represents an average of Z and ρ has a value between +1 and −1. If ρ > 0 , the correlation is positive; conversely, if ρ < 0 , the correlation is negative. Z x and Z y are Z-scores defined as:
Z x = x x σ x , Z y = y y σ y ,
which are normalized to an average 0 and standard deviation 1. The Pearson correlation coefficient is a useful tool in describing the relationship between two variables; however, when we apply it, we should be aware of the following assumptions.
  • Two variables follow a linear relationship.
  • Averages and standard deviations are well-defined.
Namely, if two variables do not follow a linear relation, or if variables (x, y) include outliers that significantly affect the correlation coefficient, we cannot trust the coefficient. In addition, in cases where the distributions of x or y follow power law distributions, ( P ( x ) x β , 0 < β 2 ) , the standard deviation tends to diverge when the number of samples increases. Therefore, we need to consider that the Pearson correlation coefficient is meaningless in this context.
Second, we focus on the possibility that a correlation may be spurious. For instance, we assume three variables, e.g., temperature, and the sales of ice cream and soft drinks. Calculating correlations among these values, we probably find positive correlations. We consider when the temperature rises, people tend to buy more ice cream and soft drinks to provide relief from the heat; however, it is not reasonable to assume that the sales of soft drinks are increasing because the sales of ice cream are rising. In this case, we find the positive correlation between sales of ice cream and soft drinks. However, this could happen even if there is no relation among these variables, in which case we call the correlation a spurious correlation. When we detect a strong correlation, we are inclined to consider the existence of a linkage that includes a causal relationship, which often leads to an incorrect result. Therefore, when we deal with multiple variables, we need to introduce a process which excludes spurious correlations.
Third, when we apply regression analysis, we must pay attention to the following points:
  • Overfitting
  • Spurious regression
  • Multicollinearity
Overfitting is the case where we take many variables to reproduce details of training data, however, the model does not have the same performance for the test data, which are not used in the training. If we run regression analysis to non-stationary time series, we often have a large coefficient of determination, which is called a spurious regression [10,11]. Moreover, if more than two explanatory variables are a strongly correlated, the relationships between the explanatory variables are not independent. Therefore, optimal regression coefficients are not uniquely obtained, which is called multicollinearity. In this case, the reliability of the regression coefficients is remarkably reduced.
In cases where we handle multivariate variables with non-normal distribution and non-stationary dynamics, it is important to carefully check their appropriateness for the Pearson correlation coefficient and regression analysis. In the next section, we introduce a method which removes these difficulties and estimates the economic index using big blog data.

3. Discussion

In this section, we summarize our method. First, we systematically found the words from many candidates (1352 words) that had strong correlations with the composite index (CI) coincidence by using Fisher’s exact test which is a non-parametric method and has robustness for outliers. We then applied grouping and round robin (detection of spurious correlation) to reduce the number of candidates which had a strong correlation with other candidates or a spurious correlation; As a result, we obtained 17 representative words. We graphically demonstrated the relationships between the words which helped to obtain an intuitive understanding of the results of the variable selections. We then proposed the regression method for the time series, which widely fluctuated. Our model reasonably reproduced the economic index using seven representative words which are statistically determined by the proposed method. We also showed the daily index by applying the regression results and the optimal moving average method. The announcement of an economic index by the government usually delayed because of the time required to gather and analyze data. The proposed method could reduce this time lag, because it is possible to collect blogs and calculate the index everyday.
The proposed method could be applied not only in estimating economic indicators but also in situations where we find good explanatory values to illustrate an explained value, and it is one of the most important tasks in analyzing big data.

Author Contributions

Conceptualization, K.Y., H.T. and M.T.; Formal Analysis, K.Y.; Methodology, K.Y., H.T. and M.T.; Validation, K.Y., H.T. and M.T.; Investigation, K.Y, H.T. and M.T.; Data Curation, K.Y.; Writing Original Draft Preparation, K.Y.; Writing Review & Editing, K.Y., H.T. and M.T.; Visualization, K.Y.; Supervision, M.T.; Project Administration, M.T.; Funding Acquisition, M.T.

Funding

This work is partly supported by JSTP recursory Research for Embryonic Science and Technology, Grant Number JPMJPR14DA (K.Y.), JSPS Grant-in-Aid for Scientific Research (B), Grant Number JP26310207 (M.T.), and JST Strategic International Collaborative Research Program (SICORP) on the topic of “ICT for a Resilient Society” by Japan and Israel (M.T.), New statistics development project utilizing big data and its analysis technology (New statistics development). 2015METI, Japan (Ministry of Economy, Trade and Industry, Japan.) (M.T.).

Acknowledgments

The authors thank Hottolink Inc. for useful discussions and providing the data, and MUFG Bank, Ltd. for useful comments.

Conflicts of Interest

H.T. is employed by Sony Computer Science Laboratories. Sony Computer Science Laboratories provided support in the form of salary to H.T., but had no role in the design of the study, in the collection, analyses or interpretation of the data, in the writing of the manuscript nor in the decision to publish the results.

References

  1. Sano, Y.; Yamada, K.; Watanabe, H.; Takayasu, H.; Takayasu, M. Empirical analysis of collective human behavior for extraordinary events in the blogosphere. Phys. Rev. E 2013, 87. [Google Scholar] [CrossRef] [PubMed]
  2. Fujiyama, T.; Matsui, C.; Takemura, A. A Power-Law Growth and Decay Model with Autocorrelation for Posting Data to Social Networking Services. PLoS ONE 2016, 11, e0160592. [Google Scholar] [CrossRef] [PubMed]
  3. Takayasu, M.; Sato, K.; Sano, Y.; Yamada, K.; Miura, W.; Takayasu, H. Rumor Diffusion and Convergence during the 3.11 Earthquake: A Twitter Case Study. PLoS ONE 2015, 10, e0121443. [Google Scholar] [CrossRef] [PubMed]
  4. Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef] [PubMed]
  5. Preis, T.; Moat, H.S.; Stanley, H.E. Quantifying Trading Behavior in Financial Markets Using Google Trends. Sci. Rep. 2013, 3. [Google Scholar] [CrossRef] [PubMed]
  6. De Choudhury, M.; Sundaram, H.; John, A.; Seligmann, D.D. Can blog communication dynamics be correlated with stock market activity? In Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA, USA, 19–21 June 2008. [Google Scholar]
  7. Goel, S.; Hofman, J.M.; Lahaie, S.; Pennock, D.M.; Watts, D.J. Predicting consumer behavior with Web search. PNAS 2010, 107, 17486–17490. [Google Scholar] [CrossRef] [PubMed]
  8. Choi, H.; Varian, H. Predicting the Present with Google Trends. Econ. Rec. 2012, 88, 2–9. [Google Scholar] [CrossRef]
  9. Sakaki, T.; Okazaki, M.; Matsuo, Y. Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors. In Proceedings of the 19th International Conference on World Wide Web, New York, NY, USA, 26–30 April 2010. [Google Scholar]
  10. Granger, C.W.J.; Newbold, P. Spurious regressions in econometrics. J. Econ. 1974, 2, 111–120. [Google Scholar] [CrossRef]
  11. Phillips, P. Understanding spurious regressions in econometrics. J. Econ. 1986, 33, 311–340. [Google Scholar] [CrossRef]
  12. Antoniak, C.E. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Ann. Stat. 1974, 2, 1152–1174. [Google Scholar] [CrossRef]
  13. Ohnishi, T.; Mizuno, T.; Aihara, K.; Takayasu, M.; Takayasu, H. Systematic tuning of optimal weighted-moving-average of yen-dollar market data. In Practical Fruits of Econophysics; Springer: Tokyo, Japan, 2006; pp. 62–66. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.