# Predicting the Critical Number of Layers for Hierarchical Support Vector Regression

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### Previous Work

## 2. Methods

#### 2.1. Support Vector Regression

#### 2.2. Dynamic Mode Decomposition (Dmd)

#### 2.3. Hierarchical Support Vector Regression

## 3. Predicting the Depth of Models

#### 3.1. Phase Transition of the Training Error

#### 3.2. Critical Scales: Intuition and the Fourier Transform

#### 3.3. Determining Scales with FFT

Algorithm 1 Determining scales of HSVR model |

Input:$({x}_{i},{y}_{i}),i=0,\cdots n-1$, where ${x}_{i}$ are equidistant points in domain and ${y}_{i}$ values of function we want to model1: $dx=x\left[1\right]-x\left[0\right]$ 2: $freq=$ FFT frequencies of the signal 3: $C=FFT\left(y\right)$ 4: $C=C/max\left(\right|C\left|\right)$ # normalize coefficients respect to $L1$-norm 5: $fre{q}_{support}=freq\left[\right|C|>0.01]$ 6: $scales=dx/(6*fre{q}_{support})$ 7: $scales$ = sort scales in descending order 8: $scales=filter\left(scales\right)$ 9: return $scales$ |

Algorithm 2 Filtering scales |

Input:$scales$ = vector of scales determined from FFT, $decay$ 1: $scale{s}_{filtered}=\left[scales\left[0\right]\right]$ 2: $n=len\left(scales\right)$ 3: for i in range(1,n):4: if $scale{s}_{filtered}[-1]/scales\left[i\right]>=decay$:5: $scale{s}_{filtered}.append\left(scales\left[i\right]\right)$ 6: return: $scale{s}_{filtered}$ |

Algorithm 3 Train HSVR |

Input:$({x}_{i},{y}_{i}),i=0,\cdots n-1$scales (output of Algorithm 1) 1: $\u03f5=0.01({max}_{i}\left({y}_{i}\right)-{min}_{i}\left({y}_{i}\right))$ 2: ${r}_{0}=y=[{y}_{0},\cdots ,{y}_{n-1}]$ 3: model = $\left[\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\right]$ # comment: empty list to hold the SVR model at each layer 4: $m=len\left(scales\right)$ # comment: number of HSVR layers 5: for i in range(0, m):6: ${\sigma}_{i}$ = scales[i] 7: ${C}_{i}=5(max\left({r}_{i}\right)-min\left({r}_{i}\right))$ 8: $sv{r}_{i}$ = fitted SVR on $(x,{r}_{i})$ with parameters ${\sigma}_{i}$, ${C}_{i}$ and tolerance $\u03f5$ 9: predictions = $sv{r}_{i}$.predict(x) 10: ${r}_{i+1}={r}_{i}-predictions$ 11: model.append($sv{r}_{i}$) 12: return: model |

#### 3.4. Determining Scales with Dynamic Mode Decomposition

Algorithm 4 Estimating scales from data using Hankel DMD |

Input:time step: $\Delta x$, time series f: $f\left[n\right]=f(\Delta xn)$, length of time series vector: N, tolerance for support: $tol$, $\eta $, M: number of rows of Hankel matrix 1: H = Hankel matrix made from f with M rows and N-M columns 2: $rez,\lambda ,Vtn=DMD\_RRR\left(H\right)$ 3: $\omega =\frac{1}{2\pi i}ln\left(\frac{\lambda}{\left|\lambda \right|}\right)$ 4: $T=0$ 5: for i = 0 to N − 1:6: $E\left[i\right]=|\langle Y[:,0],Vtn[:,i]\rangle |$ 7: $T=T+E{\left[i\right]}^{2}$ 8: $T=\sqrt{T}$ 9: ${S}_{DMD}=\left[\right]$ 10: for i = 0 to N − 1:11: if $rez\left[i\right]<tol$ and $energy\left[i\right]>\eta T$12: ${S}_{DMD}.append\left(\omega \left[i\right]\right)$ 13: return ${\sigma}_{DMD}=\frac{\Delta x}{6{S}_{DMD}}$ |

## 4. Results

#### 4.1. Explicitly Defined Functions

#### 4.2. ODE’s

#### 4.3. Vorticity Data

#### 4.3.1. Doubly Periodic Data

^{2}. There are in total 1201 snapshots with time step $dt=0.03125$. This results in a tensor of dimensions $128\times 128\times 1201$. For each fixed point in space, there is a signal with 1201 time steps. A few examples of such signals are shown in Figure 6.

#### 4.3.2. Non Periodic Data

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Joachims, T. The Maximum-Margin Approach to Learning Text Classifiers: Methods Theory, and Algorithms. In Ausgezeichnete Informatikdissertationen; Lecture Notes in Informatics (LNI); Koellen Verlag: Bonn, Germany, 2002; Available online: https://dl.gi.de/bitstream/handle/20.500.12116/4447/GI-Dissertations.02-6.pdf?sequence=1 (accessed on 24 December 2020).
- Wahba, G. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Adv. Kernel-Methods-Support Vector Learn.
**1999**, 6, 69–87. [Google Scholar] - Vapnik, V.; Chapelle, O. Bounds on error expectation for support vector machines. Neural Comput.
**2000**, 12, 2013–2036. [Google Scholar] [CrossRef][Green Version] - Ong, C.S.; Smola, A.J.; Williamson, R.C. Learning the kernel with hyperkernels. J. Mach. Learn. Res.
**2005**, 6, 1043–1071. [Google Scholar] - Hutter, F.; Kotthoff, L.; Vanschoren, J. (Eds.) Hyperparameter Optimization. In Automated Machine Learning: Methods, Systems, Challenges; Springer International Publishing: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar] [CrossRef][Green Version]
- Yu, T.; Zhu, H. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv
**2020**, arXiv:2003.05689. [Google Scholar] - Chapelle, O.; Vapnik, V.; Bousquet, O.; Mukherjee, S. Choosing multiple parameters for support vector machines. Mach. Learn.
**2002**, 46, 131–159. [Google Scholar] [CrossRef] - Chung, K.M.; Kao, W.C.; Sun, C.L.; Wang, L.L.; Lin, C.J. Radius margin bounds for support vector machines with the RBF kernel. Neural Comput.
**2003**, 15, 2643–2681. [Google Scholar] [CrossRef] - Gold, C.; Sollich, P. Model selection for support vector machine classification. Neurocomputing
**2003**, 55, 221–249. [Google Scholar] [CrossRef][Green Version] - Hooke, R.; Jeeves, T.A. “Direct Search” Solution of Numerical and Statistical Problems. JACM
**1961**, 8, 212–229. [Google Scholar] [CrossRef] - Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by simulated annealing. Science
**1983**, 220, 671–680. [Google Scholar] [CrossRef] - Mallick, B.K.; Ghosh, D.; Ghosh, M. Bayesian classification of tumours by using gene expression data. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2005**, 67, 219–234. [Google Scholar] [CrossRef][Green Version] - Friedrichs, F.; Igel, C. Evolutionary tuning of multiple SVM parameters. Neurocomputing
**2005**, 64, 107–117. [Google Scholar] [CrossRef] - Frohlich, H.; Chapelle, O.; Scholkopf, B. Feature selection for support vector machines by means of genetic algorithm. In Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, Sacramento, CA, USA, 5 November 2003; pp. 142–148. [Google Scholar]
- Igel, C. Multi-objective model selection for support vector machines. In Proceedings of the International Conference on Evolutionary Multi-Criterion Optimization, Guanajuato, Mexico, 9–11 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 534–546. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef] - Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput.
**2004**, 14, 199–222. [Google Scholar] [CrossRef][Green Version] - Schmid, P.J. Dynamic mode decomposition of numerical and experimental data. J. Fluid Mech.
**2010**, 656, 5–28. [Google Scholar] [CrossRef][Green Version] - Rowley, C.W.; Mezić, I.; Bagheri, S.; Schlatter, P.; Henningson, D.S. Spectral analysis of nonlinear flows. J. Fluid Mech.
**2009**, 641, 115–127. [Google Scholar] [CrossRef][Green Version] - Bellocchio, F.; Ferrari, S.; Piuri, V.; Borghese, N.A. Hierarchical approach for multiscale support vector regression. IEEE Trans. Neural Netw. Learn. Syst.
**2012**, 23, 1448–1460. [Google Scholar] [CrossRef] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Arbabi, H.; Mezić, I. Ergodic theory, dynamic mode decomposition, and computation of spectral properties of the Koopman operator. SIAM J. Appl. Dyn. Syst.
**2017**, 16, 2096–2126. [Google Scholar] [CrossRef] - Drmač, Z.; Mezić, I.; Mohr, R. Data driven modal decompositions: Analysis and enhancements. SIAM J. Sci. Comput.
**2018**, 40, A2253–A2285. [Google Scholar] [CrossRef] - Tithof, J.; Suri, B.; Pallantla, R.K.; Grigoriev, R.O.; Schatz, M.F. Bifurcations in quasi-two-dimensional Kolmogorov-like flow. J. Fluid Mech.
**2017**, 837–866. [Google Scholar] [CrossRef][Green Version] - Suri, B.; Tithof, J.; Mitchell, R.; Grigoriev, R.O.; Schatz, M.F. Velocity profile in a two-layer Kolmogorov-like flow. Phys. Fluids
**2014**, 26, 053601. [Google Scholar] [CrossRef][Green Version] - Schoenholz, S.S.; Gilmer, J.; Ganguli, S.; Sohl-Dickstein, J. Deep Information Propagation. arXiv
**2017**, arXiv:1611.01232. [Google Scholar]

**Figure 2.**SVR results with different scale of Gaussian kernel $\sigma $ = 0.5, 0.05, 0.005, 0.0005. A large kernel provides smooth regression, but cannot reconstruct the details. A small kernel overfits, is unable to generalize, and can be sensitive to noise.

**Figure 3.**Flowchart of the HSVR modeling process. The input data is first used to compute the scales used for the HSVR model (see Algorithms 1, 2 and 4). At layer 0, an SVR model is trained at the coarsest scale ${\gamma}_{0}$. The residual is computed by taking the difference between the signal and the model. This residual is then modeled with an SVR model at the next coarsest scale ${\gamma}_{1}$. A new residual is computed by taking the difference of the old residual and the ${\gamma}_{1}$ SVR model. This process is repeated until the pre-computed scales are exhausted.

**Figure 4.**Residuals while training HSVR model with decreasing $\sigma $ as shown on x-axis. Both HSVR models exhibit a phase transition in their approximation error.

**Figure 5.**Fitting Gaussians to half-periods of sinusoids using the heuristic (27).

**Figure 6.**Examples of vorticity at 3 different space-points for fluid simulations with doubly periodic boundary conditions.

**Figure 7.**Histograms of $error/\u03f5$ for models trained on vorticity data with doubly periodic boundary conditions. $error$ is the model error given by (24) (with $i=L$) and $\u03f5$ is given by (35). There were ${128}^{2}=\mathrm{16,384}$ total models trained. The count on the vertical axis is the number of models that fell into the corresponding bin.

**Figure 9.**Histograms of $error/\u03f5$ for models trained on vorticity data with non-periodic boundary conditions. $error$ is the model error given by (24) (with $i=L$) and $\u03f5$ is given by (35). There were $359\times 279=\mathrm{100,161}$ total models trained. The count on the vertical axis is the number of models that fell into the corresponding bin.

**Table 1.**Results for explicitly defined functions, when using scales determined from FFT with decay 2 and $\u03f5$ given by (35), for ${e}^{x}$ DMD did not output frequencies different from 0 (entries denoted by * in the corresponding row).

Function | $\mathit{\u03f5}$ | Predicted # of Layers (FFT) | Error (FFT) | Predicted # of Layers (DMD) | Error (DMD) |
---|---|---|---|---|---|

$sin\left(2\pi x\right)$ | 0.02 | 1 | 0.02 | 1 | 0.02 |

$sin\left(20\pi x\right)$ | 0.0199 | 1 | 0.021 | 1 | 0.02 |

$sin\left(200\pi x\right)$ | 0.019 | 1 | 0.093 | 1 | 0.097 |

$100sin\left(20\pi x\right)$ | 1.99 | 1 | 2 | 1 | 2.01 |

40 cos(2$\pi x)$ | 0.8 | 1 | 0.8 | 1 | 0.8 |

100 cos(20$\pi x)$ | 2 | 1 | 2.03 | 1 | 2 |

sin(2$\pi {x}^{2})$ | 0.0199 | 5 | 0.02 | 1 | 0.02 |

$x+{x}^{2}+{x}^{3}$ | 0.14 | 2 | 0.14 | 1 | 8 |

${e}^{x}$ | 0.063 | 1 | 0.064 | * | * |

$x+sin\left(2\pi {x}^{4}\right)$ | 0.03 | 7 | 0.037 | 1 | 0.034 |

$cos\left(2\pi x\right)+sin\left(20\pi x\right)$ | 0.0397 | 2 | 0.0404 | 2 | 0.042 |

$cos\left(20\pi x\right)sin\left(15\pi x\right)$ | 0.02 | 2 | 0.021 | 2 | 0.022 |

$cos{\left(32\pi x\right)}^{3}$ | 0.0199 | 1 | 0.022 | 2 | 0.022 |

$sin\left(13\pi x\right)+sin\left(17\pi x\right)+$ $sin\left(19\pi x\right)+sin\left(23\pi x\right)$ | 0.076 | 1 | 0.077 | 1 | 0.077 |

$sin\left(50\pi x\right)sin\left(20\pi x\right)cos\left(15\pi x\right)$ | 0.0187 | 3 | 0.02 | 2 | 0.02 |

$sin\left(40\pi x\right)cos\left(10\pi x\right)+$ $3sin\left(20x\right)sin\left(40x\right)$ | 0.064 | 5 | 0.065 | 3 | 0.066 |

$sin\left(2x\right)cos\left(32x\right)$ | 0.0198 | 5 | 0.02 | 1 | 0.02 |

Function | $\mathit{\u03f5}$ | Predicted # of Layers (FFT) | Error (FFT) | Predicted # of Layers (DMD) | Error (DMD) |
---|---|---|---|---|---|

x(t) | 0.314 | 6 | 0.325 | 2 | 0.324 |

y(t) | 0.408 | 6 | 0.469 | 2 | 0.469 |

z(t) | 0.468 | 5 | 0.485 | 2 | 0.494 |

$\mathit{\u03f5}$ | Predicted # of Layers (FFT) | Error (FFT) | Predicted # of Layers (DMD) | Error (DMD) | |
---|---|---|---|---|---|

min | 0.0199 | 6 | 0.038 | 1 | 0.0399 |

mean | 0.0354 | 8 | 0.082 | 3 | 0.148 |

max | 0.0488 | 9 | 0.284 | 5 | 2.463 |

$\mathit{\u03f5}$ | Predicted # of Layers (FFT) | Error (FFT) | Predicted # of Layers (DMD) | Error (DMD) | |
---|---|---|---|---|---|

min | 0.019 | 5 | 0.0006 | 2 | 0.0006 |

mean | 0.035 | 7 | 0.0975 | 3 | 0.197 |

max | 0.048 | 9 | 0.667 | 7 | 6.137 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mohr, R.; Fonoberova, M.; Drmač, Z.; Manojlović, I.; Mezić, I. Predicting the Critical Number of Layers for Hierarchical Support Vector Regression. *Entropy* **2021**, *23*, 37.
https://doi.org/10.3390/e23010037

**AMA Style**

Mohr R, Fonoberova M, Drmač Z, Manojlović I, Mezić I. Predicting the Critical Number of Layers for Hierarchical Support Vector Regression. *Entropy*. 2021; 23(1):37.
https://doi.org/10.3390/e23010037

**Chicago/Turabian Style**

Mohr, Ryan, Maria Fonoberova, Zlatko Drmač, Iva Manojlović, and Igor Mezić. 2021. "Predicting the Critical Number of Layers for Hierarchical Support Vector Regression" *Entropy* 23, no. 1: 37.
https://doi.org/10.3390/e23010037