# A Note on Combining Machine Learning with Statistical Modeling for Financial Data Analysis

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preliminary Considerations and General Ideas

## 3. A Practical Example

#### 3.1. Distribution Modeling

#### 3.2. Financial Risk Measures

#### 3.3. Combining The Prior with Nonparametric Estimation

## 4. Empirical Illustration

## 5. Discussion and Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Some Classes of Beta-Generated Distributions

## References

- Alexander, Carol, and José-María Sarabia. 2010. Generalized Beta-Generated Distributions. In ICMA Centre Discussion Papers in Finance DP2010-09. Reading: ICMA Centre. [Google Scholar]
- Alexander, Carol, Gauss M. Cordeiro, Edwin M. M. Ortega, and José-María Sarabia. 2012. Generalized beta-generated distributions. Computational Statistics & Data Analysis 56: 1880–97. [Google Scholar]
- Azzalini, Adelchi, and Antonella Capitanio. 2003. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. Journal of the Royal Statistical Society. Series B 65: 367–89. [Google Scholar] [CrossRef]
- Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Heidelberg: Springer. [Google Scholar]
- Bolancé, Catalina, Montserrat Guillén, Jim Gustafsson, and Jens Perch Nielsen. 2012. Quantitative operational Risk Models. New York: Chapman & Hall/CRC Finance. [Google Scholar]
- Bollerslev, Tim. 1987. A Conditionally Heteroskedastic Time Series Model for Speculative Prices and Rates of Return. The Review of Economics and Statistics 69: 542–47. [Google Scholar] [CrossRef][Green Version]
- Breiman, Leo. 2001. Statistical Modeling: The Two Cultures. Statistical Science 16: 199–231. [Google Scholar] [CrossRef]
- Buch-Larsen, Tine, Jens Perch Nielsen, Montserrat Guillén, and Catalina Bolancé. 2005. Kernel density estimation for heavy-tailed distributions using the Champernowne transformation. Statistics 39: 503–18. [Google Scholar] [CrossRef]
- Cordeiro, Gauss M., and Mário de Castro. 2011. A new family of generalized distributions. Journal of Statistical Computation and Simulation 81: 883–93. [Google Scholar] [CrossRef]
- Dai, Jing, Sefan Sperlich, and Walter Zucchini. 2016. A simple method for predicting distributions by means of covariates with examples from welfare and health economics. Swiss Journal of Economics and Statistics 152: 49–80. [Google Scholar] [CrossRef][Green Version]
- Eilers, Paul H. C., Brian D. Marx, and Maria Durbán. 2015. Twenty years of P-splines. Statistics and Operation Research Transactions 39: 149–86. [Google Scholar]
- Eugene, Nicholas, Carl Lee, and Felix Famoye. 2002. The beta-normal distribution and its applications. Communications in Statistics: Theory Methods 31: 497–512. [Google Scholar] [CrossRef]
- Friedman, Jerome H. 1998. Data Mining and Statistics: What’s the connection? Computing Science and Statistics 29: 3–9. [Google Scholar]
- Glad, Ingrid K. 1998. Parametrically guided non-parametric regression. Scandinavian Journal of Statistics 25: 649–68. [Google Scholar] [CrossRef]
- Gonzales-Manteiga, Wenceslao, and Rosa M. Crujeiras. 2013. An updated review of goodness-of-fit tests for regression models. Test 22: 361–411. [Google Scholar] [CrossRef] [PubMed]
- Grammig, Joachim, Constantin Hanenberg, Christian Schalg, and Jantje Sönksen. 2020. Diverging Roads: Theory-Based vs. Machine Learning-Implied Stockrisk Premia. Tübingen, Germany: University of Tübingen Working Papers in Business and Economics, No 130, University of Tübingen. [Google Scholar]
- Härdle, Wolfgang, Gérard Kerkyacharian, Dominique Picard, and Alexander Tsybakov. 1998. Wavelets, Approximation, and Statistical Applications. Heidelberg: Springer. [Google Scholar]
- Härdle, Wolfgang, Marlene Müller, Stefan Sperlich, and Alexander Werwatz. 2004. Nonparametric and Semiparametric Models. Heidelberg: Springer. [Google Scholar]
- Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Heidelberg: Springer. [Google Scholar]
- Heidenreich, Niels-Bastian, Anja Schindler, and Stefan Sperlich. 2013. Bandwidth Selection Methods for Kernel Density Estimation: A review of fully automatic selectors. AStA Advances in Statistical Analysis 97: 403–33. [Google Scholar] [CrossRef][Green Version]
- Horowitz, Joel L. 1998. Semiparametric Methods in Econometrics. Heidelberg: Springer. [Google Scholar]
- Jones, M. Chris, and M.J. Faddy. 2003. A Skew Extension of the t-Distribution, with Applications. Journal of the Royal Statistical Society. Series B 65: 159–74. [Google Scholar] [CrossRef]
- Jones, M. Chris. 2004. Families of distributions arising from distributions of order statistics. Test 13: 1–43. [Google Scholar]
- Köhler, Max, Anja Schindler, and Stefan Sperlich. 2014. A Review and Comparison of Bandwidth Selection Methods for Kernel Regression. International Statistical Review 82: 243–74. [Google Scholar] [CrossRef][Green Version]
- Kyriakou, Ioannis, Parastoo Mousavi, Jens Perch Nielsen, and Michael Scholz. 2019. Forecasting benchmarks of long-term stock returns via machine learning. Annals of Operations Research. [Google Scholar] [CrossRef][Green Version]
- Lin, Yi, and Yongho Jeon. 2006. Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association 101: 578–90. [Google Scholar] [CrossRef]
- Mammen, Enno, Jens Perch Nielsen, Michael Scholz, and Stefan Sperlich. 2019. Conditional Variance Forecasts for Long-Term Stock Returns. Risks 7: 113. [Google Scholar] [CrossRef][Green Version]
- Martínez Miranda, M. Dolores, Jens Perch Nielsen, and Stefan Sperlich. 2009. One Sided Cross Validation for Density Estimation. In Operational Risk Towards Basel III: Best Practices and Issues in Modeling, Management and Regulation. Edited by Greg N. Gregoriou. Hoboken: John Wiley and Sons, pp. 177–96. [Google Scholar]
- Meyer, Mary C. 2008. Inference using shape-restricted Regression Splines. Annals of Applied Statistics 2: 1013–33. [Google Scholar] [CrossRef]
- Nielsen, Jens Perch, and Stefan Sperlich. 2003. Prediction of stock returns: A new way to look at it. ASTIN Bulletin 33: 399–417. [Google Scholar] [CrossRef][Green Version]
- Rigby, Robert A., and D. Mikis Stasinopoulos. 2006. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society. Series C 54: 507–54. [Google Scholar] [CrossRef][Green Version]
- Rigby, Robert A., Mikis D. Stasinopoulos, Gillian Z. Heller, and Fernanda De Bastiani. 2019. Distributions for Modeling Location, Scale, and Shape: Using GAMLSS in R. New York: Chapman & Hall/CRC Finance. [Google Scholar]
- Ruppert, David, Matt P. Wand, and Raymond J. Carroll. 2003. Semiparametric Regression. Cambridge: Cambridge University Press. [Google Scholar]
- Samuel, Arthur. 2006. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development 3: 210–29. [Google Scholar] [CrossRef]
- Scholz, Michael, Jens Perch Nielsen, and Stefan Sperlich. 2015. Nonparametric prediction of stock returns based on yearly data: The long-term view. Insurance: Mathematics and Economics 65: 143–55. [Google Scholar] [CrossRef]
- Scholz, Michael, Stefan Sperlich, and Jens Perch Nielsen. 2016. Nonparametric long term prediction of stock returns with generated bond yields. Insurance: Mathematics and Economics 69: 82–96. [Google Scholar] [CrossRef][Green Version]
- Severini, Thomas A., and Joan G. Staniswalis. 1994. Quasi-likelihood estimation in semiparametric models. Journal of the American Statistical Association 89: 501–11. [Google Scholar] [CrossRef]
- Silverman, Bernard W. 1984. Spline Smoothing: The Equivalent Variable Kernel Method. Annals of Statistics 12: 898–916. [Google Scholar] [CrossRef]
- Theodossiou, Panayiotis. 1998. Financial Data and the Skewed Generalized T Distribution. Management Science 44: 1650–61. [Google Scholar] [CrossRef]
- Tibshirani, Robert. 1996. Regression Shrinkage and Selection via the lasso. Journal of the Royal Statistical Society. Series B 58: 267–88. [Google Scholar] [CrossRef]
- Zhu, Dongming, and John Galbraith. 2010. A generalized asymmetric Student-t distribution with application to financial econometrics. Journal of Econometrics 157: 297–305. [Google Scholar] [CrossRef]

1. | Further advantages are that semiparametric modeling can help to overcome the curse of dimensionality and that semiparametric models are more robust to the choice of smoothing parameters. |

2. | You may develop numerical approximations working with (6), but this is clearly beyond the scope of this note. However, the above studies insinuate that the gain by using the more complex Type 2 class is rather marginal. Those advantages get easily compensated by the local estimator. |

**Figure 1.**Graphics of the probability density function: left for the skewed t1 (1) when $(a,b)=$ (2,2), (5,2), (8,2), (2,5), and (2,8); center for the skewed t2 (2) with $(a,b,c)=$ (2,2,0.5), (8,2,0.5), (5,2,0.5), (2,5,0.5), and (2,8,0.5); and on the right, (2,2,2), (8,2,2), (5,2,2), (2,5,2), and (2,8,2).

**Figure 2.**Plots of the theoretical cdfs of the skewed t models (LEFT: ${T}_{1}$ model; RIGHT: ${T}_{2}$ model) and the empirical cdf. Stocks: Amadeus; BBVA.

**Figure 3.**Estimates $\widehat{{\mu}_{1}}$, $\widehat{log{\mu}_{2}}$ for BBVA stock returns as functions of IBEX35.

**Figure 4.**Conditional densities of stock returns at the quantiles of IBEX35 for Amadeus (

**upper left**), BBVA (

**upper center**), Mapfre (

**upper right**), Repsol (

**lower left**), and Telefónica (

**lower right**).

**Figure 5.**Unconditional densities of stock returns obtained from integrating the conditional ones over all observed IBEX35 values.

Stocks | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

Maximum daily return | 0.046286 | 0.040975 | 0.050847 | 0.073466 | 0.062264 |

Minimum daily return | −0.097367 | −0.060703 | −0.067901 | −0.0877323 | −0.051563 |

Mean | 0.000900 | −0.000452 | −0.000623 | −0.001416 | −0.000408 |

Standard deviation | 0.014601 | 0.016249 | 0.015942 | 0.021349 | 0.016301 |

Skewness | −1.163797 | −0.465779 | −0.723655 | −0.166165 | 0.130372 |

Kurtosis | 10.292160 | 3.824688 | 4.873980 | 5.435928 | 4.422885 |

**Table 2.**Maximum likelihood estimates for the skewed t model of Type 1, standardized data. Standard errors are in parenthesis.

Stocks | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

$\widehat{a}$ | 6.194309 | 10.773980 | 7.271484 | 5.009976 | 7.083988 |

(2.378890) | (8.474473) | (3.684818) | (1.980271) | (3.810294) | |

$\widehat{b}$ | 6.171897 | 10.76088 | 7.250156 | 5.005015 | 7.086958 |

(2.378415) | (8.477441) | (3.686769) | (1.980433) | (3.810390) |

**Table 3.**Maximum likelihood estimates for the skewed t model of Type 2, standardized data. Standard errors are in parenthesis.

Stocks | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

$\widehat{a}$ | 1.050617 | 0.935678 | 0.8685684 | 0.808572 | 1.120804 |

(0.443072) | (0.407211) | (0.329048) | (0.363157) | (0.650834) | |

$\widehat{b}$ | 5.126098 | 7.144007 | 6.217545 | 2.998354 | 3.497796 |

(2.091549) | (5.104088) | (3.184219) | (0.917090) | (1.110353) | |

$\widehat{c}$ | 2.973896 | 3.653026 | 3.617017 | 2.721519 | 2.331579 |

(0.761879) | (0.733523) | (0.668503) | (0.698227) | (0.774010) |

Stocks | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

Skewed t ${T}_{1}$ | 0.593 | 0.676 | 0.499 | 0.761 | 0.829 |

Skewed t ${T}_{2}$ | 0.732 | 0.908 | 0.733 | 0.732 | 0.915 |

**Table 5.**Values at risk ${\mathrm{VaR}}_{T1}[0.05;a,b)]$ and ${\mathrm{VaR}}_{T2}[0.05;a,b,c]$ for the five stocks considered.

Stocks | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

$Va{R}_{T1}$ | −0.024941 | −0.028328 | −0.028521 | −0.040059 | −0.029110 |

$Va{R}_{T2}$ | −0.023089 | −.02817 | −0.027794 | −0.038330 | −0.028029 |

IBEX35 | Amadeus | BBVA | Mapfre | Repsol | Telefónica | |||||
---|---|---|---|---|---|---|---|---|---|---|

Quartile | ${\widehat{\mathit{\mu}}}_{1\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{2\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{1\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{2\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{1\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{2\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{2\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{1\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{2\mathit{j}}$ | ${\widehat{\mathit{\mu}}}_{1\mathit{j}}$ |

${Q}_{1}$ | −0.323500 | 2.336970 | −0.493668 | 1.489340 | −0.481228 | 2.215213 | −0.430198 | 1.654110 | −0.509551 | 1.941169 |

${Q}_{2}$ | 0.025464 | 2.385443 | 0.094427 | 1.240531 | 0.060678 | 1.487417 | 0.056462 | 1.474321 | 0.034497 | 1.253373 |

${Q}_{3}$ | 0.280358 | 2.492818 | 0.503731 | 1.523203 | 0.439138 | 1.623399 | 0.405799 | 1.698132 | 0.433284 | 1.421627 |

**Table 7.**Parameter $({a}_{j},{b}_{j})$ of the conditional stock return distributions for given IBEX35 values.

IBEX35 | Amadeus | BBVA | Mapfre | Repsol | Telefónica | |||||
---|---|---|---|---|---|---|---|---|---|---|

Quartile | ${\widehat{\mathit{a}}}_{\mathit{j}}$ | ${\widehat{\mathit{b}}}_{\mathit{j}}$ | ${\widehat{\mathit{a}}}_{\mathit{j}}$ | ${\widehat{\mathit{b}}}_{\mathit{j}}$ | ${\widehat{\mathit{a}}}_{\mathit{j}}$ | ${\widehat{\mathit{b}}}_{\mathit{j}}$ | ${\widehat{\mathit{a}}}_{\mathit{j}}$ | ${\widehat{\mathit{b}}}_{\mathit{j}}$ | ${\widehat{\mathit{a}}}_{\mathit{j}}$ | ${\widehat{\mathit{b}}}_{\mathit{j}}$ |

${Q}_{1}$ | 1.742906 | 2.136783 | 5.397247 | 6.904428 | 1.990996 | 2.690554 | 3.143186 | 4.048275 | 2.490487 | 3.398563 |

${Q}_{2}$ | 1.736629 | 1.709150 | 5.492494 | 5.226327 | 3.133551 | 3.019128 | 3.184576 | 3.076590 | 5.016808 | 4.924298 |

${Q}_{3}$ | 1.953575 | 1.639172 | 6.478997 | 5.008818 | 4.327494 | 3.356619 | 3.640170 | 2.851049 | 6.809153 | 5.483973 |

IBEX35 | Amadeus | BBVA | Mapfre | Repsol | Telefónica |
---|---|---|---|---|---|

${Q}_{1}$ | −0.037728 | −0.038910 | −0.044802 | −0.053801 | −0.043939 |

${Q}_{2}$ | −0.031199 | −0.027981 | −0.030278 | −0.041117 | −0.029327 |

${Q}_{3}$ | −0.026029 | −0.020886 | −0.022744 | −0.032601 | −0.021913 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sarabia, J.M.; Prieto, F.; Jordá, V.; Sperlich, S.
A Note on Combining Machine Learning with Statistical Modeling for Financial Data Analysis. *Risks* **2020**, *8*, 32.
https://doi.org/10.3390/risks8020032

**AMA Style**

Sarabia JM, Prieto F, Jordá V, Sperlich S.
A Note on Combining Machine Learning with Statistical Modeling for Financial Data Analysis. *Risks*. 2020; 8(2):32.
https://doi.org/10.3390/risks8020032

**Chicago/Turabian Style**

Sarabia, José María, Faustino Prieto, Vanesa Jordá, and Stefan Sperlich.
2020. "A Note on Combining Machine Learning with Statistical Modeling for Financial Data Analysis" *Risks* 8, no. 2: 32.
https://doi.org/10.3390/risks8020032