1. Introduction
In this paper, the estimation of historical volatility is considered for financial time series generated by stock prices and indexes. This estimation is a necessary step for the volatility forecast which is crucial for the pricing of financial derivatives and for optimal portfolio selection. The methods of estimation and forecast of volatility have been intensively studied (see, e.g., the references in
Andersen and Bollerslev (
1997) and in
De Stefani et al. (
2017);
Dokuchaev (
2014)).
In pricing of derivatives, option traders use volatility as the input for determining the value of an option using underlying models such as the Black–Scholes’ (
Black and Scholes 1973) and
Heston’s (
1993) option pricing models. Hence, being able to forecast the direction and magnitude of the future volatility on different time horizons will provide advantages in terms of pricing risks and the development of trading strategies.
There is an enormous body of research on modelling and forecasting volatility.
Engle (
1982) and
Bollerslev (
1986) first proposed the ARCH model and the GARCH model for forecasting volatility. These models have been extended in a number of directions based on the empirical evidences that the volatility process is non-linear, asymmetry, and has a long memory. Such extensions can be referred to EGARCH—
Nelson (
1991), GJR-GARCH—
Glosten et al. (
1993), AGARCH—
Engle (
1990), and TGARCH—
Zakoian (
1994). However, studies have found that those models cannot describe the whole-day volatility information well enough because they were developed within low-frequency time sequences.
With the appearance of high-frequency data,
Andersen et al. (
2003) introduced a new volatility measure. This proxy was known as realized volatility (RV). In comparison with the GARCH-type measures, realised volatility is preferred as it is a model-free measure. Hence, it provides convenience for calculation. In addition, the realised volatility takes high-frequency data into consideration and exhibits the long memory property. There have been many forecasting models that have been developed to predict the realised volatility. Among those models, the heterogeneous autogressive model for realised volatility (HAR) by
Corsi (
2003) is one to name. The HAR-RV model was developed in accordance with the heterogeneous market hypothesis proposed by
Muller et al. (
1997) and the long memory character of realised volatility by
Andersen et al. (
2003). Empirical studies have shown that the HAR model has high forecasting performance for future volatility, especially for out-of-sample data with different time horizons (
Corsi 2003;
Khan 2011).
Another commonly used volatility measure is the implied volatility. The implied volatility is often derived from the observed market option prices and is regarded as the fear gauge
Whaley (
2000). The implied volatility fluctuates with stock movement, strike price, interest rate, time-to-maturity, and option price. To reduce the impact of stock price movement, a so-called “purified” implied volatility was introduced in
Luong and Dokuchaev (
2014). In the present paper, we show that that this volatility measure contains some information about the future volatility.
To produce rules for prediction for the classes and the regression of the outcome variables, classification and regression tree models and other machine learning techniques have been developed in the literature (see the references in
De Stefani et al. (
2017)). This paper explores the related random forests algorithm to improve the forecasting of realised volatility in the machine learning setting.
This algorithm is constructed to predict both the direction and the magnitude of realised volatility, based on the HAR model framework with the inclusion of the purified implied volatility.
The paper is structured as follows. In
Section 2, we provide the background of the volatility measures, the classical HAR model, and the random forests algorithm. We then discuss our proposed model and methodology and their results in
Section 3.
Section 4 provides discussion of the study, and we conclude the results of this study in
Section 5.
2. Materials and Methods
2.1. Random Forests Algorithm
Breiman (
2001) introduced the random forests (RF) algorithm as an ensemble approach that can also be thought of as a form of nearest neighbour predictor. The random forest starts with a standard machine learning technique called “decision trees”. We provide a brief summary of this algorithm in this section.
2.1.1. Decision Trees
The decision trees algorithm is an approach that uses a set of binary rules to calculate a target class or value. Different from predictors like linear or polynomial regression where a single predictive formula is supposed to hold over the entire data space, decision trees aim to sub-divide the data into multiple partitions using a recursive method, and then fit simple models to each cell of the partition. Each decision tree has three levels:
Root nodes: entry points to a collection of data;
Inner nodes: a set of binary questions where each child node is available for every possible answer;
Leaf nodes: respond to the decision to take if reached.
For example, in order to predict a response or class
Y from inputs
, a binary tree is constructed based on the information from each input. At the internal nodes in the tree, a test to one of the inputs is run for a given criterion with logical outcomes:
TRUE or
FALSE. Depending on the outcome, a decision is drawn to the next sub-branches corresponding to the
TRUE or
FALSE response. Eventually, a final prediction outcome is obtained at the leaf node. This prediction aggregates or averages all of the training data points which reach that leaf.
Figure 1 illustrates the binary tree concept.
Algorithm 1 describes how a decision tree can be constructed using CART from (
Breiman et al. 1984). This algorithm is computationally simple and quick to fit the data. In addition, as it requires no parametric, no formal distributional assumptions are required. However, one of the main disadvantages of tree-based models is that they exhibit instability and high variance, i.e., a small change in the data can result in a very different series of split, or over-fitting. To overcome such a major issue, we used an alternative ensemble approach known as the random forests algorithm.
Algorithm 1: Classification And Regression Trees - CART algorithm for building decision trees. |
1: Let N be the root node with all available data. |
2: Find the feature F and threshold value T that split the samples assigned to N into subsets |
and , to maximise the label purity within these subsets. |
3: Assign the pair (F, T) to N. |
4: If is too small to be split, attach a ‘child’ leaf node to and to N and |
assign the leaves with the most present label in and , respectively. |
If subset is large enough to be split, attach child nodes and to N, |
and then assign to them, respectively. |
5: Repeat steps 2–4 for the new nodes and until the new subsets |
can no longer be split. |
2.1.2. Random Forests
A random forest can be considered to be a collection or ensemble of simple decision trees that are selected randomly. It belongs to the class of so-called bootstrap aggregation or bagging technique which aims to reduce the variance in an estimated prediction function. Particularly, a number of decision trees are constructed and random forests will either “vote” for the best decision (classification problems) or “average” the predicted values (regression problems). Here, each tree in the collection is formed by firstly selecting, at random, at each node, a small group of input coordinates (also called features or variables hereafter) to split on and secondly, by calculating the best split based on these features in the training set. The tree is grown using the CART algorithm to maximum size, without pruning. The use of random forests can lead to significant improvements in prediction accuracy (i.e., better ability to predict new data cases) in comparison with a single decision tree, as discussed in the previous section. Algorithm 2 from
Breiman (
2001) details how the random forests can be constructed.
For
, the algorithm uses random splitter selection.
m can also be set to the total number of predictor variables which is known as Breiman’s bagger parameter (
Breiman 2001). In this paper, we set
m as equal to the maximum number of variables of interest used in the proposed model.
Applications of the random forests algorithm can be found in machine learning, pattern recognitions, bio-infomatics, and big data modelling. Recently, a number of financial literatures have applied the random forests algorithm to the forecasting of stock prices as well as in developing the investment strategies found in
Theofilatos et al. (
2012) and
Qin et al. (
2013). Here, we introduce an application of the random forests algorithm involving the forecasting of the realised volatility.
Algorithm 2: Random forests |
1: Draw a number of bootstrap samples from the original data () to be grown. |
2: Sample N cases at random with replacement to create a subset of the data. The subset is |
then split into in-bag and out-of-bag samples at a selected ratio (i.e., 7:3). |
3: At each node, for a preselected number m, m predictor variables () are chosen at |
random from all the predictor variables. |
4: The predictor variable that provides the best split, according to some objective function, |
is used to build a binary split on that node. |
5: At the next node, choose another m variables at random from all predictor variables. |
6: Repeat 3–5 until all nodes are grown. |
2.2. Volatility Measures
Volatility, often measured by the standard deviation or variance of returns from a financial security or market index, is an important component of asset allocation, risk management, and pricing derivatives. In this section, we discuss the two measures of volatility known as the realised volatility and the purified implied volatility.
2.2.1. Realised Volatility
The realised volatility measure was proposed by
Andersen et al. (
2003) in 2003 based on the use of high frequency data.
Let
represent the asset price which is observed at equally-spaced discrete points within a given time interval
, where
,
and
. We assume that
is represented by the following Ito equation
where
is a standard Brownian process,
and
are predictable processes with
being the standard deviation of
and independent of
. Therefore, the processes
and
represent the instantaneous conditional mean and volatility of the return. Hence,
Following this result, let us assume that the time interval
is observed evenly at Δ steps in discrete time. The realised volatility (RV) of
can be estimated by
where
,
, and
M is the number of observations within that time interval.
2.2.2. The Purified Implied Volatility
The implied volatility is often known as the ex-ante measure of volatility, and is derived from either the Black–Scholes’ options pricing model from
Black and Scholes (
1973) (model-based estimation) or from theoptions market price formula by
Carr and Wu (
2006) (model-free estimation). Such measures depend on several inputs, such as time-to-expiration, stock price, exercise price, risk-free-rate-of-interest, and observed call/put price. Hence, the implied volatility will vary in accordance with the fluctuations of these inputs. In order to reduce the impact of the stock price movements, the purified implied volatility (PV) was introduced in
Luong and Dokuchaev (
2014). The purified implied volatility is derived from the Black–Scholes options pricing model, where the market option prices are replaced by artificial option prices that reduce the impact of the market price from the observed option prices. The paper also shows that the purified implied volatility does contain information about the traditional volatility measure (i.e., the standard deviation of the low-frequency daily returns). In this paper, we include the purified implied volatility as an extended variable of the HAR model.
2.3. Models for Volatility
2.3.1. Heterogeneous Autoregressive Model for Realised Volatility
It is noted that the definition of realised volatility involves two time parameters: (1) the intraday return interval Δ and (2) the aggregation period one day. For the heterogeneous autoregressive model of realised volatility from
Corsi (
2003), it is considered that the latent realised volatility is viewed over time horizons longer than one day. The
n days historical realised volatility at time
t (i.e.,
) is estimated as an average of the daily realised volatility between
and
t. The daily HAR is expressed by
where
days,
days, and
present the average realised volatility of the last 5 days and 22 days, respectively. The HAR model can be extended by including the jump component proposed by
Barndorff-Nielsen and Shephard (
2001) such that
where
is the realised bi-power variation
Barndorff-Nielsen and Shephard (
2004). Hence, the general form of the model is
Most recently, the heterogeneous structure was extended with the inclusion of the leverage effect observed by
Black (
1976)—the asymmetry in the relationship between returns and volatility noticed by
Corsi and Reno (
2009). For a given period of time, the leverage level at time
t is measured as the average aggregated negative and positive returns during that period where
with
M being the number of observations between
,
t, and Δ is the time step. Therefore, one would include the leverage effect as a predictor for the realised volatility in the next
k days as follows:
Often, the coefficients are obtained by using the Ordinary-Least-Squares (OLS) estimation for linear regression models.
2.4. The Modified HAR Model for Realised Volatility and Forecasting the Direction
We define two states of the world outcome on the volatility direction as “UP” and “DOWN”. Let be the direction of the realised volatility observed at the time , such that
In order to forecast the direction of realised volatility, a set of predictors (or technical indicators) is used which are derived from the historical price movement of the underlying asset and its realised volatility. Since all available historical information is used, does not follow a Markov chain. We investigated a number of indicators and through the feature selection process (using variable importance ranking from the random forest algorithm), we found that the following indicators were best for forecasting the realised volatility’s direction.
The Average True Range (ATR): The ATR is an indicator that measures volatility by using the high–low range of the daily prices. ATR is based on n-periods and can be calculated on an intraday, daily, weekly, or monthly basis. It is noted that ATR is often used as a proxy for volatility. To estimate
, we are required to compute the “true range” (TR) such that
where
,
,
are the current highest return, the current lowest return, and the previous last return of a selected period, respectively, with absolute values to ensure
is always positive. Hence, the average true range within
n-days is
Close Relative To Daily Range (CRTDR): The location of the last return within the day’s range is a powerful predictor of next-returns. Here, CRTDR is estimated by
where,
,
and
are the high, low, and close returns at time
for a selected time period using high frequency returns.
Exponential Moving Average of realised volatility (EMARV): Exponential moving averages reduce the lag effect in time-series by applying more weight to recent prices. The weighting applied to the most recent price depends on the number of periods (
n) in the moving average and the weighting multiplier (
). The formula for EMARV of n-periods is as follows:
Moving average convergence/divergence oscillator (MACD) measure of realised volatility: The MACD is one of the simplest and most effective momentum indicators. It turns two moving averages into a momentum oscillator by subtracting the longer moving average (
m-days) from the shorter moving average (
n-days). The MACD fluctuates above and below the zero line as the moving averages converge, cross, and diverge. We estimate the MACD for realised volatility as
Relative Strength Index for realised volatility (RSIRV): This is also a momentum oscillator that measures the speed and change of volatility movements. We define RSIRV as
where
is the average increase in volatility and
is the average decrease in volatility within
n-days.
The steps that we take to forecast the volatility direction are listed in Algorithm 3.
Algorithm 3: Forecasting the direction of realised volatility |
1: Obtain the direction of the realised volatility. |
2: Compute the above technical indicators for each observation. |
3: Split the data into a training set and a testing set. |
4: Apply the random forests algorithm to the training set to develop the pattern solution of |
the realised volatility using the above indicators. |
5: Use the solution from Step 4 to predict the direction of the testing set. |
Figure 2 demonstrates a possible decision tree that was built for forecasting the direction of realised volatility
using the above steps. In this example, node #4 can be reached when RSI-RV(5)
and TR(10)
, with 19% of the in-sample data falling into this category and 91% of these observations being classified as “DOWN”. Likewise, node #27 is reached when RSI-RV(5)
,
, and
. In random forests, we can construct similar trees but with different structures to classify the direction of the realised volatility based on the information from other predictors.
Let denote the predicted direction of the realised volatility at time using Algorithm 3.
2.5. Forecasting the Realised Volatility—The Proposed Model
To forecast the realised volatility, we consider the heterogeneous autoregression model as discussed in
Section 2.3.1. We further include the purified implied volatility and the predicted direction of the future volatility as new predictive variables. Particularly, the model (
7) is extended to
We also consider the logarithmic form of this model, as the logarithmic of the realised volatility is often believed to be a smoother process. Thus, we model
as
where
for 1-day, 5-day, and 22-day time horizons. We use
instead of
to allow for the cases where
, and the leverage effect is measured by
to allow for the average aggregated negative returns.
The parameters in models (
15) and (
16) (HAR-JL-PV-D) are fitted using the random forests regression algorithm. It is important to note that for the in-sample data, we replace
with the actual direction
to measure the impact of the direction variable on the forecasting of the realised volatility.
4. Discussion
Forecasting problems for financial time series are challenging since these series have a significant noise component. Currently, there is no consensus on the possibility of forecasting for asset prices using a technical analysis or a mathematical algorithm. The forecasting of parameters of stochastic models for financial time series, including volatility, is also challenging. Moreover, even statistical inference for parameters of financial time series is usually difficult. An additional difficulty is that these parameters are not directly observable; they are defined by the underlying model and by many other factors. For example, it appears that the volatility depends on the sampling frequency and on the delay parameter in the model equation see, e.g.,
Luong and Dokuchaev (
2016). In addition, there is no a unique comprehensive model for stock price evolution; for example, there are many models with stochastic equations for volatility, with jumps, with fractional noise, etc. Respectively, even a modest improvement in forecasting for the parameters of financial time series would be beneficial for the practitioners.
Our paper explored the HAR (
Corsi and Reno 2009) model with the main focus being to extend this model family via two new features, the purified volatility and the forecast volatility movement, and the implementation of this machine learning algorithm to improve the forecast of realised volatility.
By utilising the availability of high frequency data, we showed that the direction of the realised volatility can be forecast with the random forests algorithm by using the proposed technical indicators, with an accuracy of above 80% for the selected time series. However, this accuracy could be further improved if we could integrate fundamental indicators such as financial news.
The errors in forecasting the realised volatility with our proposed features also showed further improvement on top of the existing HAR-JL model. Particularly, this was done through the addition of information derived from the purified volatility and the predicted direction of the volatility. We believe that the predictions of realised volatility would further be improved by using other tree-based algorithms such as Extreme Gradient Boosting (XGBoost) or Bayesian additive regression trees (BART). However, we leave this for future study.