Gaussian Mixture and Kernel Density-Based Hybrid Model for Volatility Behavior Extraction From Public Financial Data

: This paper carried out a hybrid clustering model for foreign exchange market volatility clustering. The proposed model is built using a Gaussian Mixture Model and the inference is done using an Expectation Maximization algorithm. A mono-dimensional kernel density estimator is used in order to build a probability density based on all historical observations. That allows us to evaluate the behavior’s probability of each symbol of interest. The computation result shows that the approach is able to pinpoint risky and safe hours to trade a given currency pair.


Introduction
Many researchers build mathematical models and algorithms for price prediction [1] or trend classification [2]. For that, some of them used linear discrimination algorithms [2] or regression algorithms like in [3]. The aim of this work is to use statistical learning-part of artificial intelligence-to find profitable hours for trading. The work focuses on the currency market as a special case of financial markets, but it is extensible to other components such as stocks, commodities, etc. Volatility is a statistical measure of the dispersion of returns for a given market index or foreign exchange symbol. Volatility shows how quick the prices move, it can either be measured by using the standard deviation or variance between returns from that same security or market index. As explained in [4], the bid-ask spreads or difference between the highest price and the lowest at a given time frame gives a dispersion measure. Commonly, the higher the volatility, the riskier the security. Volatility is influenced by many factors like liquidity, interest rate [5], real estate [6], opinions [7], and a firm's share. The authors in [8] explain how liquidity providers of a market impact the volatility and stock returns, while the study in the paper [6] re-examine the relationship between a firm's market share and volatility.
This paper is an extended version of work published in [9] that talks about volatility estimation. For that reason, let us first recall the main content of the previous work. It affirms that the financial markets calendar anomalies have been profoundly examined and contemplated finance professionals for a long time. Many artificial intelligence research also focuses on the financial market stability as an application. Let us talk about the factors that influence the asset prices. As introduced in [10], the trading volume is the factor that investors have considered in the prediction of prices. It is a measure of the quantity of shares that change owners for a given financial product. For example, on the New York Stock Exchange known by "NYSE", the daily average volume for 2002 was 1.441 billion shares, contributing to many billion dollars of securities traded each day among the roughly 2800 companies listed on the NYSE. The quantity of daily volume on a security can fluctuate on any given day depending on the amount of published-or known with different manner-information available about the company. This information can be a press releasex or a regular earnings announcement provided by the company, or it can be a third party communication, such as a court ruling, social networking like Twitter as explained in [11], or a release by a regulatory agency pertaining to the company. The abnormally large volume was due to differences in the investors' view of the valuation after taking in consideration the new information. Because of what can be inferred from abnormal trading volume, analysis of trading volume and associated price changes corresponding to informational releases has been of much interest to researchers. It is important for an investor or trader to study the stability factor of any market and many other factors that influence the prices like the interest rates [12,13], jobs, political stability, and more. Whatever it is, the difference between the highest and lowest asset values-that we will call a spread in this work-will be bigger when the prices are impacted by external and, perhaps unknown, factors.
This paper is organized as follows: the first section introduces related works. Since this paper is an extension of [9], the related works section recalls the main guidelines explained in [9]. In addition to that, we discuss-in the same section-the latests related works. The section "Approach Demystification" introduces the main definitions ad preliminaries in order to demystify the proposed approach. Next section untitled "MCMC Computational Model Building"-introduces the proposed hybrid model for volatility clustering. The use of mixture model combined with the density model is clarified in this section. Section "Testing and Discussion" show the obtained result on some symbols and discusses the homogeneity between the financial experts' point of views and the approach results. Finally, the "Conclusion" recalls the main ideas and discusses the applications on algorithmic trading and financial data analysis.

Existing Market Volatility Measurement
In trading, volatility is measured using certain indicators including average true range (ATR) and standard deviation. Moving averages (MA) and Bollinger bands are used for trend detection, but Bollinger bands can be used in volatility detection; the two bands diverging implies that the activity is starting (volatility). Each of these indicators can be used slightly differently to measure the volatility of an asset and interpret this data in a different format, but the computation of all those technical indicators uses continuous historical data that makes the signal noisy. Let us give an example; if the technical indicator is configured on the p-periods, the volatility measure at 14:00 is computed with p periods just before 14:00. In fact, this way of computation ignores all the events that happen at 14:00. This is why this computation must be done with the price's variations of each 14:00 of the last p days.

Proposed Volatility Classifier Recall
In this sub-section, we recall the main content of the contribution [9]. We consider the volume as one of the crucial parameters. The proposed algorithm will continuously supervise the evolution of the invested volume for each time frame (1 h in our case). To do so, let's call γ v k (h) the kth observed value of the volume at the hour h. For example, if our data set contains the measures of one month, so we will have many values of the volume for the same hour h: the volume today at 14 h is not necessarily the same yesterday at 14 h.
The stability of the forex market depends on the published news, their impacts, the behavior of the traders, and others [12]. Actually, we do not have access to all those information and some are hidden. Some technical analysis like RSI (Relative Strength Index), MACD (Moving Average Convergence Divergence), and Moving Averages like in [14] allows to detect some hidden intentions, but we believe that all those factors impact the variation and the evolution speed of the prices. In some period of the day, the same scenarios are repeated. The decision is really too hard and very fast for some time. However, the trends are somehow as expected for those pairs but for others hours, while the trading is not really interesting in some period of the day because the evolution is very slow. The trader will not loose and also will not win in this case. This is why we have to identify those clusters, and then the trader will choose the medium hours to trade. For that, we define the price spread γ d k (h) at the hour h and the difference between the highest price and the lowest one of the same hour.
We have access to the volume, the prices and the highest and lowest prices archive. That allows our online learning algorithm to build the stability indicator for each hour. Let us suppose that we have n observations for the hour h stored in the learning data set: The stability estimator goal is mainly the study of the average behavior of the impact of the volume and spread on the price spread. We define the random variable X n (h) in (1) as the vectorial average of all observations given the concerned hour h. Formally:

Multivariate Probability Density
Let us focus now on the probability density of the random variable X h n that we defined in (1). In probability theory, the central limit theorem-in the Vectorial Central Limit Theorem appendix-establishes that, when independent random variables are added, their sum converges to a Gaussian distribution even if those variables themselves do not follow a normal distribution.
Since the random variable X h n is defined in (1) as an arithmetic mean and based on the theorem The probability density function of Z h n is the bi-variate Gaussian and |C| is the determinant of the covariance matrix C.

Target Region Definition
Let Ω v be the volume target region; a set of medium volume values between v 1 and v 2 . We avoid all volumes less than v 1 because small values means the prices changes very slowly: that is to say, the trader will neither lose nor win. Greater volumes than v 2 implied a huge investment done by big investors or a collective decision done by the majority of the traders. Generally, this happens after an important economical or political announcement, those moments are risky and recommended to be avoided. Following the same reasoning, we define Ω d be the spread target region: a set of medium spread values between d 1 and d 2 . The final target region is in the next definition:

Target Inclusion Probability
Since we have the density function of Z h n but the target values of X h n , we have to transform those values to be adapted to Z h n . Having be the transformed target regions for the new random variable Z h n , where γ v and γ d are the arithmetic means of volume and spread respectively. In order to work with the density in Equation (2), we have to use the transformed region Ω t defined in (3): We can now move to the decision rule. In this subsection, we explain how the decision will be made; we can define the hypothesis H 0 meaning that the financial market is stable and profitable (the market don't moves slowly, no flash crashes and no high volatility (with a probability)). It holds if the X h n is in the target region: p(X h n ∈ Ω t ) ≥ α. The opposite hypothesis H 1 is hold and H 0 is rejected Holding H 1 means that the trader have to avoid that risky trading at the hour h. This probability is computed using the next formula p(X h n ∈ Ω t ) which is equal to p(Z h n ∈ Ω t ) in (7):

Data Set
The data comes streaming from some free web services or directly from a broker. In our case, we have used the platform MetaTrader-which is by default configured for GMT+2-to download the data sets. For each financial instrument, the data contains the seven main elements, the date, and time of each observation as a first component. The open and close prices that are the prices in the beginning and the end of each time frame (one hour in the context of this approach). Then the high and low prices that represents the highest and lowest variations of the price between the beginning and the end of the time frame. Finally, volume that represents the invested amount of money in that time frame. Our data set used in this paper contains non labeled historical spreads for each hour of the day. In order to avoid the confusion between the spread in our case and the spread as known in financial markets, let us call it a spread; the difference between the initial price (Open price described above) at the beginning of the hour and the price when the hour ends (Close price).
The second part of the analysis process is to filter the data set. By filtering we mean that we create a data set for each currency pair and each hour of the day. For example, if we want to study the EURUSD and USDCAD pairs at 14h (from 14:00 to 14:59), we select only the price spreads of that period for the two pairs.

Visualization and Observation
In this case, we have chosen the next currency pairs: EURUSD, EURGBP, GBPUSD, AUDCAD and NZDCHF. The same process can be applied to the others. Table 1 shows some details about each currency pair above:

Approach Demystification
The idea behind the proposed approach in this paper, is in one hand, to estimate-for each hour of the day-the probability density function for the price spreads. then find for each traded symbol and for each hour of the day the. We recall that the price spread is the difference between the highest and the lowest price. That reflects the randomness of volatility [15]-that is explained by many hidden events and information [16]-during a specific time period.
Unsupervised anomaly detection is a fundamental problem in machine learning, with critical applications in many areas, such as cyber security, medical care, complex system management, and more. That inspires us to detect the activity class of currencies at a given time. The density estimation is in the core of activity detection: given a lot of input samples, activity corresponds to spreads that are residing in high probability density areas.

Main Definitions
The first step that is to be explained in this subsection is the stability regions meaning. The economical and political news announcements influence considerably financial market trends and sometimes the prices do not move considerably for many hours. This is why we have to define the values m * r and m * s as decision boundaries. In this paper, the computation is done with 5 pips for m * r and 40 pips for m * s . It means the less than 5 pips is considered a slow spread and more than 40 pips is a risky spread. See the region definition bellow: • Risk Region: Γ r {m ∈ D(m), |m| ≥ m * r }, this region represents the the huge price spreads considered very risky. These are periods to avoid in order to minimize the risk to be a flash crash victim o any other quick spread due to economical or geopolitical events.
• Slow Region: Γ s {m ∈ D(m), |m| ≤ m * s }, this region represents the small price spreads that does not overpass the value m * r that we can initialize with 5 pips for example.
, is the best region that price spreads are normal and the trends can be kept for a while. Trading is recommended in this period.

Test on Experimental Data
Using real experimental data, Figure 1 shows the price spread distributions from 10:00 to 10:59. The plot allows us to do a comparative study between the behavior of many symbols from 10:00 to 10:59. This study is done for the symbols: EURGBP, EURUSD, GBPUSD, AUDCAD, and NZDCHF. Horizontal red lines represent the risk region delimited by −m * r and m * r while horizontal green lines represent the values −m * s and m * s . The goal is to estimate for the studied time (10:00 to 10:59, in this example), the probability that the spread (price spread) is concentrated in the target region Γ t .
The same symbols are used in Figure 2 but the time slot is different. In this case, collected data was in the range 00:00 to 00:59. The volatility behavior seems to be different: most price spreads are concentrated in the slow region (−m * s and m * s ), between the two green lines.

MCMC Computational Model Building
In statistics, kernel density estimation-whose model is given in (5)-is a non-parametric way to estimate the probability density function of a random variable [17]. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.f In some fields such as signal processing and econometrics it is also termed the Parzen-Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. Gaussian kernel is used in our approach, it is given by the Equation (6) bellow: The goal is to compute the probability that the symbol behavior-represented by the spread m-is in the target region Γ t as defined before. That comes to compute the integral of the estimated densitŷ f -in (5)-all over Γ t .
In order to compute the Gaussian integral, we rely on Monte Carlo simulation as it is applied in [18][19][20]. Let us generate a sample s 1 , ...s n mc from a uniform distribution in the target region Γ t . The sample size is n mc , it is better to be big as possible (n mc → inf). The computation of the probability p(m ∈ Γ t ) as defined in (7) approximated using Monte Carlo stochastic method for numerical integrals computation. This method affirms that ∆ · n −1 mc (f (s 1 ) + ... +f (s n mc )) p(m ∈ Γ t ) converges to the desired value p(m ∈ Γ t ). With ∆ is the spread of target region Γ t . It is formally given by (−m * s + m * r ) + (m * r − m * s ) = 2(m * r − m * s ). By combining the Equations (5)-(7) we obtain: Figure 3 shows the obtained target region inclusion probabilities for each hour of the day. The computations is done for many symbols at the same time in order to compare the best hours for each symbol.

Mixture Model Building
In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally, a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population [17,21]. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

Likelihood Function and Missing Data
Paper [22] affirms that when the observations can be seen as incomplete data, the general methodology used in finite Gaussian mixture involves the representation of the mixture problem as a particular case of maximum likelihood estimation (MLE). This setup implies consideration of two sample spaces, the sample space of the (incomplete) observations, and a sample space of some complete observations, the characterization of which being that the estimation can be performed explicitly at this level, as explained in [22]. For instance, in parametric situations, the MLE based on the complete data may exist in closed form. Among the numerous reference papers and monographs on this subject are, e.g., the original EM algorithm -in Appendix A -and the finite mixture model book and references therein. We now give a brief description of this setup as it applies to finite mixture models in general. The observed data consist of n i.i.d. observations x = (x 1 , ..., x n ) from a density f Θ (x) by (9).
In order to simplify the likelihood, we can introduce latent variables z i such that: (x i |z i = j) ∼ φ j and p(z i = j) = λ j is the prior probability of x i going to cluster j. These auxiliary variables allows us to identify the mixture component each observation has been generated from. The mixture, in (9), with hidden variable z i becomes h Θ (x i , z i ): With 1 z i =j is equal to 1 if the condition z i = j is hold and 0 if not. The log-likelihood, that we denote here with L( x, z, Θ), is given by the quantity log h Θ ( x, z). Equation (11) shows the final likelihood expression:

Results and Discussions
This section presents obtained results visualized in scatter plots. Figures 4 and 5 show the profitability of the trade on the EURUSD and NZDCHF symbols during 24 h of the day. Each data point represent the Yes (Trade it) and No (do not) probabilities and the cluster obtained with Gaussian mixture. The scatter points format (circle or triangle) represents the cluster and differentiates the hours to on which it is recommended to trade the symbol and those on which is not recommended. In this case, the triangle represents the positive cluster then recommends to trade the symbol.
We can select only a subset of hours that concerns the trader and run this recommendation approach. That allows the trader to manage his scheduling with more flexibility. The concept can be integrated also in a trading robot implementing a strategy that can fail with small or very hight volatility.

Conclusions
In this paper, we presented a currency market volatility estimator and proposed computational model relies on Gaussian mixture model combined with the Gaussian kernel density. We have defined three classes for the volatility variation: the slow region, target region, and the risky region. The main goal of the proposed approach is to measure-for each hour of the day-the probability that a currency activity is included in a given region. The proposed model has a range of applications in financial markets. On one hand, its implementation in a platform like MetaTrader allows the trader to predict the symbol activity instantaneously; in other words, the trader can select profitable symbols at each hour of the day. On the other hand, the model can be implemented in a trader robot in order to find sufficient volatility to run a given strategy. This work is supported by real data analysis and results confirm that the model has an equivalent level of accuracy close to that of financial professionals.
The Monte Carlo method is used to compute the integrals in this approach. Computational aspects of this method need deep thinking in further works. As perspectives, a comparative study in computational complexity and accuracy has to be done between the different sampling methods such as Metropolis Hasting, Gibbs sampling and Monte Carlo Markov Chain. Profitable zone upper and lower bounds have been established as percentage in order to do the study. The models in this paper will be implemented in a real technical indicator embedded in MetaTrader platform. Profitable zone parameters will be considered as two inputs that will be given by the user according to the sensitivity that he want. Since there are some strategies that need high volatility to be profitable, we can do another study specially to find the values for different trading styles as a perspective.
We think that artificial intelligence-driven algorithmic trading is the future of financial markets. For that reason, further works will focus on the design of expert system-trading robot, or expert advisor as we call it in the MetaQuote community-that uses machine learning for the prices and trends prediction triggered with the result of the volatility estimator in this work. We believe that this collaborative distributed intelligence can produce a kind of trading robot.
We express our strong belief that a data and artificial intelligence cross disciplinary approach will give the algorithmic trading a new momentum. Finally, we hope that this work will contribute on the development of this arising field.