Linking textual information in finance reports to the stock return volatility provides a perspective on exploring useful insights for risk management. We introduce different kinds of word vector representations in the modeling of textual information: bag-of-words, pre-trained word embeddings, and domain-specific word embeddings. We apply linear and non-linear methods to establish a text regression model for volatility prediction. A large number of collected annually-published financial reports in the period from 1996 to 2013 is used in the experiments. We demonstrate that the domain-specific word vector learned from data not only captures lexical semantics, but also has better performance than the pre-trained word embeddings and traditional bag-of-words model. Our approach significantly outperforms with smaller prediction error in the regression task and obtains a 4%–10% improvement in the ranking task compared to state-of-the-art methods. These improvements suggest that the textual information may provide measurable effects on long-term volatility forecasting. In addition, we also find that the variations and regulatory changes in reports make older reports less relevant for volatility prediction. Our approach opens a new method of research into information economics and can be applied to a wide range of financial-related applications.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited