Multiscale Model Selection for High-Frequency Financial Data of a Large Tick Stock by Means of the Jensen – Shannon Metric

Modeling financial time series at different time scales is still an open challenge. The choice of a suitable indicator quantifying the distance between the model and the data is therefore of fundamental importance for selecting models. In this paper, we propose a multiscale model selection method based on the Jensen–Shannon distance in order to select the model that is able to better reproduce the distribution of price changes at different time scales. Specifically, we consider the problem of modeling the ultra high frequency dynamics of an asset with a large tick-to-price ratio. We study the price process at different time scales and compute the Jensen–Shannon distance between the original dataset and different models, showing that the coupling between spread and returns is important to model return distribution at different time scales of observation, ranging from the scale of single transactions to the daily time scale.


Introduction
The complexity of market behavior has fascinated physicists and mathematicians for many years [1].One of the main sources of interest comes from the difficulty of modeling the rich dynamics of asset prices.In fact, since the beginning of the last century, a large set of statistical regularities of price dynamics has been identified, including the asymptotically power-law distribution of returns, their lack of linear correlations, but the presence of very persistent higher order correlations, the slow convergence to the Gaussian distribution, scaling properties, multifractality, etc. [2][3][4].The modeling activity has been correspondingly very intense, considering models both in discrete and in continuous time, and including random walks, Levy processes, stochastic volatility models, multifractal models, etc. [1,[5][6][7][8].However, up to now, there is no consensus on a model that is able to reproduce all the statistical regularities, and therefore, there is a growing interest toward methods allowing one to discriminate among different models those more suited to describe financial data.
A specific challenge is the modeling of how the return distribution changes at different time scales [3].Due to the presence of fat-tailed distributions, also at very short time scales, and non-linear time correlations, the dynamics of the price-change distribution is far from trivial and not well described by any model.The problem becomes even more dramatic when one wants to describe the price-change distribution also at the shortest time scales, i.e., when the discrete nature of trading appears.Trading and, correspondingly, price changes occur at discrete time.Moreover, an asset price cannot assume arbitrary values, but it is constrained to live in a grid of values fixed by the exchange.The tick size is the smallest interval between two prices, i.e., the grid step.Since tick size can be a sizable fraction of the asset price, when seen at small time scales, price movement appears as a (non-trivial) random walk on a grid, with jumps occurring at random times, while at large time scales, one can probably forget the microstructural issues and describe the dynamics with a more traditional stochastic differential equation or time series approach.One of the main methodological problem is, therefore, to have a method to compare data and model predictions at different time scales.
In this work, we propose to perform multiscale model selection for financial time series by using the Jensen-Shannon distance [9][10][11], and we specifically consider the case of models describing the high frequency dynamics of large tick assets, i.e., assets where the ratio between tick and price is relatively large [12,13].We perform the model selection at different scales m, representing the level of aggregation of the time series.In other words, given the return time series, x(t), we study the properties of the probability distribution of its sums y m = m t=1 x(t).It is important to clarify that we do not perform a goodness-of-fit test at different scales m defining a p-value relative to a specific statistic, e.g., Kolmogorov-Smirnov statistic, etc. [14].Our analysis consists, instead, in the comparison between the probability distribution computed from empirical data and those computed from synthetic data generated by specific statistical models.The discrepancy is measured by the Jensen-Shannon distance.In particular, by considering a class of models recently proposed [15], we show that models containing the coupling between price and spread, as well as the time correlation of spread outperform other models without these characteristics in describing the change of the shape of the return distribution across scales.
The paper is organized as follows.In Section 2, we illustrate the definitions of the Jensen-Shannon divergence and distance, and we characterize the unavoidable bias, due to the finiteness of the data sample.In Section 3, we illustrate the statistical models of mid-price and spreads dynamics developed in [15].Moreover, we apply the Jensen-Shannon distance criteria to select among three competing models of the dynamics of the price of a large tick asset, namely Microsoft.Finally in Section 4, conclusions and perspectives are discussed.

Jensen-Shannon Distance
Distance or divergence measures are of key importance in a number of theoretical and applied statistical inference and data processing problems, such as estimation, detection, compression and model selection [16].Among the proposed measures, one of the best known is the Kullback-Leibler (KL) divergence between two distributions, D (p||q) [17], also called relative entropy.It is a measure of the inefficiency of assuming that the distribution is q when the true distribution is p.It is used in many different applications, such as econometrics [18], clustering analysis [19], multivariate analysis [20,21], neuroscience [22] and discrete systems [23].We will limit the following discussion to discrete probability distributions, but the results can be generalized to probability density functions.
Let X be a discrete random variable with support of definition X and probability mass function p (x), x ∈ X .If q (x) is another probability mass function defined on the same support, X , the KL-divergence is defined as: where the base of the logarithm is two.We use the convention that 0 log (0/0) = 0 and the convention, based on continuity arguments, that 0 log (0/q) = 0.If there is any symbol, x ∈ X , such that p (x) > 0 and q (x) = 0, then D KL (p||q) is undefined.This means that distribution p has to be absolutely continuous with respect to q for KL-divergence to be defined [24].It is well known that D KL (p||q) is non-negative and additive, but not symmetric [24].In order to overcome this problems, Lin [11] defined a new symmetric divergence, called L divergence: where m = (p + q)/2 is the "mean "probability mass function.D L (p, q) vanishes if and only if p = q.The L divergence is symmetric and bounded by D L (p, q) ≤ 2. It is worth noticing that the L divergence can be expressed in terms of the Shannon entropy as: i.e., it is the difference of entropy between the mean distribution, m, and the sum of the entropies of p and q.The generalization of the L divergence is the Jensen-Shannon divergence [11], defined as: where π 1 , π 2 ≥ 0, π 1 + π 2 = 1 are the weights of the probability distributions, p and q, respectively.According to this new definition, D L (p, q) = 2Div JS (p, q), for π 1 = π 2 = 1/2.Endres et al. [9] found that the square root of D L is a metric, i.e., it fulfills the triangle inequality.They named this new information metric the Jensen-Shannon distance, D JS : The bounds of this distance are: 0 ≤ D JS ≤ √ 2. The Jensen-Shannon divergence is used also in statistical mechanics [25], quantum mechanics [26], thermodynamics [27], networks [28], particle physics [29], biology [30] and cosmology [31].
In this paper, we are interested in using the Jensen-Shannon distance as a method for selecting among a set of models the one that best describes a given dataset.We are concerned with the case when our data is represented by a discrete time series of length N .When considering different competing models, we search for the best model describing the probability distribution of the aggregation (i.e., sum) of the time series at different time scales m.Moreover, the use of Jensen-Shannon distance allows us to compare two empirical distributions.
To be more specific, consider the random variable, x, taking values from the set where n i is the number of times the outcome was x i .The frequency ) is an estimator of the probability distribution, p.We want to perform a statistical analysis at different scales of aggregation, i.e., we study the probability distribution, p m , and frequency distribution, f m , of the sum m t=1 x(t), where the value, m, defines the scale.The probability distribution of the elementary process x(t), corresponding to m = 1, is denoted by p m=1 = p.If the initial dataset had N values, the scale, m, is limited by 1 ≤ m ≤ N .The number of experimental data available at each aggregation scale m reduces to N m ≡ N/m , because we sum the experimental data, which belong to the N m non-overlapping windows of length m.
In order to select the best model that describes the data at all aggregation scales, we compute the Jensen-Shannon distances for various values of m, i.e., D JS (p m , f m ).We estimate p m according to different statistical models, and we select the one that minimizes D JS (p m , f m ) for the different values of m.As will be clear below, we will also need to compute the distance between two frequency distributions in order to compare the two different datasets, D JS (f 1,m , f 2,m ).In this case, we assume that the length of the two datasets is the same It is important to stress that even if we knew the true distribution, p m , the distance, D JS (p m , f m ), inferred from a finite sample of data, would be larger than zero.The fluctuations of f m from dataset to dataset may not only result in fluctuations of the numerical values of D JS , but also in a systematic shift, i.e., bias, of the numerical values of D JS .This bias is identified with the expectation value, E [D JS (p m , f m )] = 0, for the various values of the scale, m.The bias is also present if we compute the distance, D JS (f 1,m , f 2,m ), between two frequency vectors that are computed from datasets representing the same stochastic process.
The concept of a systematic bias of the numerical values of Jensen-Shannon divergence, Div JS , is well known in the literature, and it is connected to the systematic bias in the estimation of entropy.It follows directly from Jensen inequality [17] that the expected value, E [H (f )], of the entropy computed from an ensemble of finite-length sequences cannot be greater than the theoretical value, H (p), of the entropy computed from the (unobservable) probabilities: where the expectation is defined over the ensemble of finite-length i.i.d.sequences generated by the probability distribution, p.It can be shown that the expected value of the observed entropy is systematically biased downwards from the true entropy: where k is the number of components of the probability and frequency vectors, p and f , and N is the ensemble size.This result was obtained by Basharin [32] and Herzel [33], who pointed out that to the first order, O (1/N ), the bias is independent of the actual distribution, p.The term of order O (1/N 2 ) involves the unknown probabilities p = (p 1 , • • • , p k ) and cannot be estimated in general [34][35][36].
Grosse et al. [37] derived an analytical approximation of the expected value of Div JS (f 1 , f 2 ) between two i.i.d.sequences of length N coming from the same probability distribution, which is: Clearly, also in the case of the Jensen-Shannon distance, D JS , there is a systematic positive bias.

A Simple Binomial Model
In this section, we present a toy example of the use of Jensen-Shannon distance for model selection.The purpose of the section is mostly didactical and serves to show the multiscale procedure and the issues related to the finiteness of the sample that will be present also in the real financial case described in the next section.
Let us consider a process, which at scale m = 1 is a binomial i.i.d.process, i.e., p m=1 is described by B (n, p B ), where p B describes the probability of success.The sum of m i.i.d.binomial variables is still described by a binomial distribution [38] and its support is a set composed by k = nm + 1 elements.Given a time series of length N , at each aggregation scale, m, we have N m ≡ N/m observations from non-overlapping windows, and we measure the frequency vector where n m,i is the number of occurrences of the event, i, at scale m.
The probability distribution of empirical frequencies is given by the multinomial distribution: In principle, one can compute exactly the moments of the distances, D JS (p m , f m ) and D JS (f 1 m , f 2 m ), which are: (10) and These expressions can be used to compute the mean and variance of the Jensen-Shannon distance, as well as of the Jensen-Shannon divergence.The computational problem with these expectations are the values of k = nm + 1 and of N , because the number of categories of the multinomial distribution grows dramatically with the scale, m.The support of the multinomial distribution for the scale, m, has a number of elements: For example if N = 1000 and n = 2, m = 1, we have that the number of elements is n.e.≈ 5 × 10 5 .
To handle this problem, we compute these expectations by means of Monte Carlo simulations, and we replace ensemble averages with sample averages, i.e., for example:  As expected, the bias decreases with N and increases with m.By using the result in Equation ( 8) for sequences of i.i.d observations, we are able to compute analytically the shape of the initial part of the curve corresponding to the Jensen-Shannon divergence.In fact, in our framework, we should perform the following substitutions in Equation ( 8), N → N/m and k → nm + 1, and we thus obtain that the scaling of the Jensen-Shannon divergence as a function of N and m for the binomial model is: This approximation is more and more valid when N increases, as we can observe in Figure 1.
In the case of the Jensen-Shannon distance, we do not have any analytic result and limit ourselves to a power-law fit of the initial part of the curve.For the case N = 10 6 , the fit gives c = (1.1 ± 0.1) × 10 −3 , e = (1.0 ± 0.1).The initial part of the curve appears to scale linearly with scale m, i.e., E [Div JS (f 1 m , f 2 m ; N )] ∝ m.In order to illustrate how to perform model selection with the Jensen-Shannon distance, we consider the case of an (artificial) sample generated from the binomial model with p B = 0.5.We then compare the Jensen-Shannon distance between this sample and another realization of the model with the same parameter and of a realization of the model with different parameter p B = 0.5.As expected, Figure 2 shows that the expected value of the Jensen-Shannon distance between two samples generated by the model with the same parameter is always smaller than the distance between two samples with a different parameter.Moreover, the distance between a sample and the true probability distribution is smaller than the distance between two samples of the same model.This simple observation suggests to us a procedure for selecting models by using the Jensen-Shannon distance.Specifically, suppose that f sam m=1 represents the frequency vector computed from the sample of length N , but we do not known the true model that generates it.Suppose we have a statistical model, from which we are able to simulate an output of the same length.In this case, we can compute a frequency, f mod m=1 , from our reference model.To compare the two processes at different scales m, we compute the frequencies, f sam m and f mod m , from the sums of the initial sample over N/m non-overlapping windows.
If we have different competing models M 1 , M 2 , • • • , we generate synthetic samples of length N and compute the distances D JS f sam m , f mod m ; l , where the index, l, runs on the possible different models.The model that minimizes the Jensen-Shannon distance at different scales m is the model that reproduces the data better.It is clear that even if we had the true model, the minimum distance at different scales will be different from zero.This is because, as we have seen before, E [D JS (f 1 m , f 2 m ; N )] is larger than zero, even when the two samples come from the real model.As we will see in the financial case in the next section, one can split the real sample into two subsamples of length N/2 and compute their Jensen-Shannon distance, to be used as a reference line with respect to the Jensen-Shannon distance between the data and the models.

Application to High Frequency Financial Data
In this section, we use the above multiscale procedure, based on the Jensen-Shannon distance, in order to select the best statistical model in the particular case of models describing the high frequency price dynamics of a large tick asset.The models used here were introduced by Curato and Lillo [15], and data refer to NASDAQ (National Association of Securities Dealers Automated) stocks at the time scale of single transactions, traded during July and August, 2009 (see [15] for more details).

Bid-Ask Spread and Price Dynamics
In financial markets, there are two important prices at each time t: the ask price, p ASK (t), and the bid price, p BID (t).A customer that wants to buy (sell) a certain volume of the stock submits a buy (sell) market order, which is executed at the ask (bid) price, p ASK (p BID ).From these two prices, we define the mid-price p mean (t) = (p ASK (t) + p BID (t)) /2.Our models are defined in transaction time, which is an integer counter of events defined by the execution of a market order, i.e., t ∈ N. Note that if a market order is executed against several limit orders, our clock advances only by one unit.The price of the order cannot assume arbitrary values, but it can be placed on a grid of fixed values determined by the exchange.The grid step is defined by the tick size, and it is measured in the currency of the asset.The presence of a finite tick size implies that the bid and ask prices can be represented by integer numbers, i.e., p ASK , p BID ∈ N, for which the unit is represented by the tick size.Our models are defined by the dynamics of two stochastic variables, i.e., mid-price changes x (t, m) between m consecutive transactions and the bid-ask spread, s (t): We measure the mid-price changes x (t) in units of half tick size, i.e., x (t) ∈ Z, and the bid-ask spread, s (t), in units of one tick size, i.e., s (t) ∈ N. The value of the integer, m, describes the time scale of observation of the price process.Here, we are interested in large tick size assets.Their principal property is that the possible values of spreads and mid-price changes belong to a small set of integer numbers.For example, in the investigated stock, it is s (t) ∈ {1, 2} and x (t, m = 1) ∈ {−2, −1, 0, 1, 2}.An example of mid-price and spread dynamics is given in Figure 3.

Markov Dynamics
We compare three different models for price dynamics in transaction time proposed in [15].The first model, called M 0 model, is defined by price changes x (t) that are independent from the spread process, s (t).Instead, in the other two models, i.e., the M S and M S B models, there is a coupling between the process of price changes, x (t), and the spread process, s (t).We now define the price-change processes relative to the time scale m = 1, setting x (t) = x (t, m = 1).
Figure 3. Dynamics of the mid-price, p mean (t), and bid-ask spread, s (t), on the price grid determined be the finite tick size of $0.01.M 0 model.The model is defined by an i.i.d process for x (t), where the unconditional distribution, p (x (t)), reproduces the empirical distribution of price changes.In this case, p (x (t) |s (t)) = p (x (t)), i.e., we have independence between the two variables, x and s.M S model.This model is defined by a particular coupling between the price changes and spread dynamics.We start from the description of the spread process, s (t), because this process will be independent from the process, x (t), whereas x (t) will be the dependent variable.
It is well known that the spread process, s (t), is autocorrelated in time [39,40].In our models, the spread process, s (t), is represented by a stationary Markov(1) process: where i, j ∈ N are spread values.The spread process is described by the two-state transition matrix, B ∈ M 2,2 (R): where the normalization is given by 2 j=1 p ij = 1.We find [15] that the coupling is not directly defined by the spread process, s (t), but by the kind of transition between spreads.Starting from the s (t) process, we can define a new stationary Markov(1) process, z (t), that describes the stochastic dynamics of transitions between states s (t) and s (t + 1) as: This Markov(1) process is defined on four possible states and is characterized by the four-state transition matrix, M ∈ M 4,4 (R): These conditioning rules are imposed by the discreteness of the price grid (see Figure 3 and [15]).In this model, we impose perfect symmetry between positive and negative values of price changes x (t).The model is defined by four parameters, p 11 , p 21 , θ 1 , θ 4 , that can be estimated from the data.M S B model.This model is a limit case of the M S model.In this case, the spread process is an i.i.d Bernoulli process defined by P (s (t) = 1) = p B .Though s (t) is an i.i.d process, z B (t) is a Markov(1) process defined by: The conditioning rules for price changes are the same as those of Equation (18).The model now is defined only by three parameters: p B , θ 1 , θ 4 .For the next section, it is useful to quantify the number of possible states of the variables, x (t, m), for different scales m.For our models, the number of states grows linearly with m, i.e., k = 1 + 4m.We stress that we do not have analytic expressions of probability distributions for the process, x (t, m > 1).This is due to the fact that the model is relatively complicated, also because it is correlated in time and, therefore, not i.i.d..For scales m > 1, we study our models only by Monte Carlo simulations, i.e., we generate a sample of N observations of the processes defined at scales m = 1, and then, we study the properties of y m = m j=1 x (j) on non-overlapping windows of length m.

Multiscale Model Selection
Our problem is now how to select the model that reproduces the data better.We focus our selection problem on the ability of the models to reproduce the price-change process at different time scales.The selection problem does not involve the spread process, s (t).In this case, we study a sample of N = 348, 253 price-change observations from the MSFT (Microsoft) stock.
We study this selection problem by means of the concepts developed in Section 2.1.First, we compute the Jensen-Shannon distance between two realizations of the real process.To this end, we divide the sample into two non-overlapping samples, each of length N/2, and we compute the two frequency vectors, f 1 m and f 2 m , for each value of m.We then compute the Jensen-Shannon distance, It is clear that this is only one of the possible values of the random variable, D JS , and we expect that it will be affected by some kind of fluctuations.Then, we generate N r = 25 synthetic samples of length N/2 of processes corresponding to our three models.In this way, we compute N r different frequencies f model m that allow us to compute the sample averages: from which we can compute a mean and a standard deviation value for the Jensen-Shannon distance for each value of m.The results for the different models are reported in Figure 4.The model that reproduces the empirical data better, i.e., which is closer to D JS (f 1 m , f 2 m ; N/2), is the M S model.It is important to notice that also the M S B model reproduces the empirical data for values of the scale m > 10.In fact, for m > 10, the M S and M S B models have the same ability to reproduce the empirical data.The conditioning rules of Equation ( 18) are critical in order to reproduce the data for values of m > 10.The model M 0, instead, appears to reproduce the data better for the scale of a single transaction, i.e., m = 1.This is only the consequence of the fact that the probability distribution of M 0 models is exactly the same as the empirical distribution of price changes for single transactions, i.e., it reproduces the small asymmetry of the real distribution between positive and negative values of price changes.Instead, M S and M S B have symmetric distributions for price changes for single transactions.We can observe that the three models reproduces the data for m > 3, 000 equally well.This corresponds to a real time of the order of one hour, i.e., the daily time scale.This time scale can be interpreted as the one after which the microscopic details of price formation and market microstructure are not relevant anymore in describing the dynamics of the shape of the return distribution.In other words, the coupling of the price-change process and of the bid-ask spread process appears to be the key to understand the dynamics of prices for a large tick stock from the time scale of single transaction to the daily time scale.
Our analysis has been performed by using the Jensen-Shannon distance.However, other distances between probability distributions exist, such as the Kolmogorov-Smirnov distance [14], the Euclidean distance and the Hellinger distance [41].We have repeated the analysis with these distances, and we found that its ability to select between competing models is smaller than that of Jensen-Shannon distance.In particular, Kolmogorov-Smirnov distance is able to discriminate models only until the scale m ≈ 100, which means that for this measure of discrepancy after m = 100, the models reproduce the sample distribution in the same way.In our framework, the distance with major discriminant power should be able to discriminate models for high values of the aggregation scale, m.The Hellinger and Euclidean distances, instead, have a discriminant power that is similar to that of Jensen-Shannon distance.

Conclusions
One important issue for the study of price dynamics is the selection and validation of a statistical model against the empirical data.Usually, financial time-series models that work well at a fixed time scale do not work comparably well at different time scales.The Jensen-Shannon distance analysis that we have performed enables us to perform an accurate test of goodness of our statistical models and to select among a pool of competing models.We have performed the same model selection procedure with different statistical distances.We find that their power to discriminate between different competing models is not larger than that of Jensen-Shannon distance.Moreover for the Jensen-Shannon distance, we have a good control of the finite sample properties.
Our analysis demonstrates that, for large tick assets, the coupling between mid-price dynamics and spread dynamics is important to account for the mid-price dynamics from the time scale of a single transaction to the time scale of one trading day.We believe that the described method, based on the Jensen-Shannon distance, could be used also in contexts different from the financial one investigated in this work.This method could be useful each time we want to perform a multiscale test for a model against the empirical data samples.
) where • • • represents the ensemble average and j represents the j-th simulation of the set of N r .We first consider the problem of the finite sample bias in the computation of the Jensen-Shannon divergence and distance.Specifically, we computeE [D JS (f 1 m , f 2 m ; N )] and E [Div JS (f 1 m , f 2 m ; N )]as a function of the time series length, N , when the two frequency vectors are taken by two independent realizations of the same binomial model.We study the two information functionals in the range m = [1, • • • , 100] for the values N = 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , as reported in Figure 1.

Figure 1 .
Figure 1.E [Div JS (f 1 m , f 2 m ; N )] (left) and E [D JS (f 1 m , f 2 m ; N )] (right) for the binomial model as a function of the aggregation scale, m, and for different values of time series length, N .Results are obtained from numerical simulations, and the plots are in log-log scale.

Figure 2 .
Figure 2. Expectations and standard deviations of the Jensen-Shannon distance between two samples of the binomial model with the same parameter p B,1 = p B,2 = 0.5 (red squares) and with different parameters (green diamonds and blue triangles).The black circles are an estimation of the Jensen-Shannon distance between a sample and the true model.

Figure 4 .
Figure 4. Mean and standard deviation of D JS between Microsoft data and three models, namely M 0, M S and M S B (see the text).The black line is the distance, D JS , between the two subsamples of the real data obtained by splitting the sample in two.We do not display the error bars for each value of m, but only for 25% of them.
while, clearly, there is only one value for s (t + 1).We can now define a Markov-Switching model, or Hidden Markov model, for the price changes, x (t):