Predicting the Critical Number of Layers for Hierarchical Support Vector Regression

Hierarchical support vector regression (HSVR) models a function from data as a linear combination of SVR models at a range of scales, starting at a coarse scale and moving to finer scales as the hierarchy continues. In the original formulation of HSVR, there were no rules for choosing the depth of the model. In this paper, we observe in a number of models a phase transition in the training error—the error remains relatively constant as layers are added, until a critical scale is passed, at which point the training error drops close to zero and remains nearly constant for added layers. We introduce a method to predict this critical scale a priori with the prediction based on the support of either a Fourier transform of the data or the Dynamic Mode Decomposition (DMD) spectrum. This allows us to determine the required number of layers prior to training any models.


Introduction
Many of the machine learning algorithms require the correct choice of hyperparameters to give the best description of the given data. One of the most popular methods for choosing hyperparameters is doing grid-search. In grid search, the model is trained for some points in hyperparameter space which are called a grid and then the model with the lowest error on the validation set is chosen. Other methods use alternative optimization algorithms, such as genetic algorithms. All these methods include training model and calculating the error multiple times, which can be very expensive if the hyperparameter space is large.
The purpose of this paper is to develop a method for determining hyperparameters for Support Vector Regression (SVR) with Gaussian kernels from time-series data only. Unlike other approaches for choosing hyperparameters, our method identifies a set of hyperparameters without needing to train models and perform a grid search (or executing some other hyperparameter optimization algorithm), thereby bypassing a potentially costly step. The proposed method identifies the inherent scale and complexity of the data and adapts the SVR accordingly. In particular, we give a method for determining the scale of Gaussian kernel by connecting it to the most important frequencies of the signal, as determined by either a Fourier transform or Dynamic Mode Decomposition. Thus we leverage classical and generalized harmonic analysis to inform the choice of hyperparameters in modern machine learning algorithms. A pertinent question is why not just use FFT? The answer is that HSVR are better suited to model strongly locally varying data whereas, for FFT to be efficient, the data need to possess some symmetry such as space or time translation.

Previous Work
There are many different approaches in tuning SVR hyperparameters for reducing the generalization error. Some popular methods of estimating generalization error are Leave-one-out (LOO) score and k cross-validation score. They are easy to implement, but for calculating those measures more models have to be trained for each combination of hyperparameters that need to be tested. This can be prohibitively expensive, so other error estimates, which are easier to calculate, were developed. These methods include the Xi-Alpha bound [1], the generalized approximate cross-validation [2], the approximate span bound [3], the VC bound [3], the radius-margin bound [3] or the quality functional of the kernel [4].
For choosing hyperparameters, the simplest method is to perform grid search over the space of hyperparameters and then choose the ones with the lowest error estimation. However, grid search suffers from the curse of dimensionality, scaling exponentially with the dimension of the hyperparameter configuration space. Other work in hyperparameter optimization (HPO) seek to mitigate this problem. Random search samples the configuration space and can serve as a good baseline [5,6]. Bayesian optimization techniques to find optimal hyperparameters, often using Gaussian processes as surrogate functions, offers a more more computational efficient algorithm than grid search or random search requiring fewer attempts to find the optimal parameters [6]. However, Bayesian optimization in this form requires more computational resources [6].
There are also gradient-based approaches [7][8][9]. There are also several derivative-free optimization methods. For example, in [10], a pattern search methodology for hyperparameters, the parameter optimization method based on simulated annealing [11], Bayesian method based on MCMC for estimating kernel parameters [12], and also methods based on Genetic algorithms [13][14][15].
What all of these methods have in common is that they require training a model and evaluating the error for each parameter set that needs to be tried. This can be quite expensive, both in time and computational resources, if there are a lot of data or we want to train multiple models at the same time. They are iterative, which means that we usually do not know how many models will be trained before getting an estimation of the best hyperparameters. There are methods, such as early cutoff, that try to reduce these burdens, but fundamentally a model needs to be trained for each set of candidate hyperparameters.
In contrast, our approach gives a set of hypeparameters without ever computing a single model. Specifically, in this paper, we are interested in modeling multiscale signals with a linear combination of SVRs. Two fundamental questions are (1) how many SVR models are going to be used? and (2) what should the scale be for each SVR model. Clearly the dimension of the configuration space for the scale parameters is conditional on the number of layers. Our methods are built off fast spectral methods like fast Fourier transform and Dynamic Mode Decomposition and return both the number of layers (number of SVR models) and the scales for each SVR without ever having to train a model. Bypassing the expensive step of computing a model for each candidate set of hyperparameters can lead to dramatic savings in computational time and resources.

Support Vector Regression
Support Vector Machines (SVM) have been introduced in [16] as a method for classification. The goal was to set hyperplane between classes, where hyperplane is defined by linear combination of a subset of a training set, called Support Vectors. The problem is formulated as a quadratic optimization problem that is convex so it has a unique solution. The problem with SVM is that it can only find a linear boundary between classes which is often not possible. The trick is to map the training data to a higher dimensional space and then use kernel functions to represent the inner product of two data vectors projected onto this space. Then only inner products are needed for finding the best parameters. The ad-vantage of this approach is that we can implicitly map data onto infinite dimensional space. One of the most used kernels is Gaussian function defined by where x and x are n-dimensional feature vectors, σ 2 is the variance of the Gaussian function, and γ = 1/σ 2 is the scale parameter that is usually specified as an input parameter to SVR toolboxes. The SVM approach has been extended in [17] to regression problems and it is called Support Vector Regression (SVR). Let S = {(x 1 , y 1 ), . . . , (x n , y n )} be the training set, where x i is vector in input space X ⊂ R D and y i ∈ R desired output. The aim of SVR is to find a regression function f : X → R: where ω is the weight vector and Φ is a mapping of the data points to a higher-dimensional space and b is threshold constant. ω and b can be found by solving following optimization problem: The parameter determines width of tube around the regression curve and points inside it do not contribute to the loss function. Parameter C adjusts the trade off between the regression error and regularization, E + and E − are slack variables for relaxing approximation constraints and measure the distance of each data point from the tube. In practice, the dual problem is solved, which can be written as: where α + , α − ∈ R n are the dual variables and K ∈ R n×n is the kernel matrix evaluated from a kernel function, is the kernel function. Solving that problem, the regression function becomes: Coefficients satisfy following conditions: Data points for which |α + i − α − i | is non-zero are called Support Vectors.

Dynamic Mode Decomposition (Dmd)
Dynamic Mode Decomposition (DMD) was introduced in [18] as a method for extracting dynamic information from flow fields that are either generated by numerical simulation or measured in physical experiment. Rowley et al. connected DMD with Koopman operator theory [19].
Let the data be expressed in a series of snapshots, given by matrix V N 1 : where v i ∈ R m stands for the i-th snapshot of the flow field. We assume there exists a linear mapping A which relates each snapshot v i to next one v i+1 , and that this mapping is approximately same during each sampling interval, so we approximately have We assume that characteristics of the system can be described by the spectral information in A. This information is extracted in a data-driven manner using V N 0 . The idea is to use V N 0 to construct an approximation of A. In [18], this was done as follows. Define Let the singular value decomposition of X be X = UΣV * . Then the representation of a compression of A is defined as where Σ + is the Moore-Penrose pseudo-inverse of Σ. Note that (14) is analytically equivalent to U * AU, the compression of A to the subspace spanned by the columns of X, if Y = AX. Eigenvectors and eigenvalues of A are approximations of eigenvectors and eigenvalues of the Koopman operator. The Koopman operator is an infinite dimensional, linear operator K that acts on all scalar functions g on M as Kg where f is a dynamical system such that x k+1 = f (x k ). Let λ j and φ j be eigenvalues and eigenfunctions, i.e., Let g(x): M → R p be vector of any quantities of interest. If g lies in span of φ j , then it can be written as Then we can express g(x k ) as The Koopman eigenvalues, λ j characterize the temporal behaviour of the corresponding Koopman mode v j , the phase of λ j determines its frequency, and the magnitude determines the growth rate.

Hierarchical Support Vector Regression
Classical SVR models which use kernels of a single scale have difficulties approximating multiscale signals. For example, consider the function f (x) = x + sin(2πx 4 ), for x ∈ [0, 2], whose graph is given in Figure 1. This function has a continuum of scales. Figure 2 highlights the difficulties that classical single-scale SVR has in modeling such signals. Using too large a scale σ, the detailed behavior of the data set is not captured [20]; such a model may, however, be useful for a coarse-scale extrapolation outside the training set. Models employing a very small scale σ can capture the training set in detail. However, this makes them very sensitive to noise in the training and severely limits the model's ability to generalize outside the training set [20].  In [20], the authors introduced a multiscale variant of Support Vector Regression which they termed Hierarchical Support Vector Regression (HSVR). The idea behind HSVR is to train multiple SVR models, organized as a hierarchy of layers, each with different scale σ. The HSVR model is then a sum of those individual models, which we write as where L is number of layers and a (x; σ , ) is SVR model on layer with Gaussian kernel with parameter σ . Each SVR layer realizes a reconstruction of the target function at a certain scale. Training the HSVR model precedes from coarser scales to finer scales as follows.
We then proceed inductively for ≥ 1. Given the residual r (x), train a model a (x; σ , ) to approximate it and compute new residual Graphically, the process looks like Figure 3. The HSVR model contains a number of hyperparmeters that need to be specified, namely, , C , the number of layers L to take, and the specific scales σ for those layers. In [20], the authors chose to use exponential decay relationship between the scales, such as σ +1 = σ / decay. This, however, still leaves the critical choices of σ 0 and L unspecified. In all of our experiments that follow, we choose to be 1 percent of the variation of the signal

Determine #layers and scales
For each layer, C was specified as which was the choice given in [20]. Additionally, the decay variable was chosen to be 2 so that σ +1 = σ / decay. Python's scikit-learn library [21] was used throughout this paper. Its implementation of SVR requires the input parameter γ, rather than σ from (1). These two parameters are related as γ = 1/σ 2 . Thus, the equivalent decay rate of the input parameter is In the next sections, we take up the task of efficiently determining the hyperparameters L and σ 0 without the expensive step of performing grid search or training any models.

Predicting the Depth of Models
In this section, we will describe how error changes while training HSVR and we will provide methods of estimating the number of layers of such hierarchical model.

Phase Transition of the Training Error
For the rest of the paper, we have training data set {(x train , y train )} and testing data set {(x test , y test )} and the model S(x) = ∑ L =1 a (x; σ l ), as in (19). Let S i (x) = ∑ i =1 a (x; σ l ) for i = 1, . . . , L. For each layer, the HSVR (prediction) error is calculated as where we have denoted y i,pred = S i (x pred ). Therefore, r i denotes the maximum error between the true signal at the test points x pred and an HSVR model with i layers. By examining the change of the values r i , it can be seen that there is a sudden drop in error, almost until tolerance , so that adding additional layers cannot reduce error any more. We will call that value the critical-sigma and denote it with σ c . The drop in error can be seen in Figure 4. Figure 4. Residuals while training HSVR model with decreasing σ as shown on x-axis. Both HSVR models exhibit a phase transition in their approximation error.

Critical Scales: Intuition and the Fourier Transform
Since the SVR models are fitting the data with Gaussians, a heuristic for choosing the scale will be shown on basic examples of periodic function and then expanded on more general cases in next sections.
Let f (x) = sin (2π f x), where x is in meters and f is the frequency in cycles per meter. The frequency will be related to the scale, σ, of the Gaussian: To relate the maximum frequency, f , to the scale, σ, we use the heuristic that we want 3 standard deviations of the gaussian to be half of the period. That is we want Since T = 1 f , then

Determining Scales with FFT
We assume that we can learn HSVR model with scales σ corresponding to important frequencies in the FFT of the signal. The assumption is that these will give an information how to train the model. If in the FFT of the signal, there are a lot of frequencies, we assume that we need more HSVR layers and each will learn the most dominant scale at this level.
Our assumption is that required number of scales in the HSVR model can be determined by the data. If there are a lot of frequencies in FFT of the signal related to relatively big coefficients, we can assume that signal is more challenging for single SVR to model it. As in [20], we refine the scales. Here we use exponential decay and use only frequencies for which corresponding coefficients in FFT are large enough in order to avoid numerical problems and adding insignificant frequencies. The procedure is summarized in Algorithm 1. The Filtering scales algorithm is summarized in Algorithm 2. The model building is outlined in Algorithm 3. 7: C i = 5(max(r i ) − min(r i )) 8: svr i = fitted SVR on (x, r i ) with parameters σ i , C i and tolerance 9: predictions = svr i .predict(x) 10: r i+1 = r i − predictions 11: model.append(svr i ) 12: return: model

Determining Scales with Dynamic Mode Decomposition
Let (x i , y i ), i = 0, . . . , n be training set. Output data y i is organized into Hankel matrix Y with M rows and N columns (M > N). We will extract relevant frequencies with Hankel DMD, described in [22], using the DMD_RRR described in [23]. The DMD_RRR algorithm returns the residuals (rez) which determine how accurately the eigenvalues are computed, the eigenvalues (λ), and the eigenvectors (Vtn). As before, suppose we have a signal f (x n ) = f (nδx), (n = 0, . . . , N − 1).
We map this scalar-valued functions into a higher-dimensional space by delayembedding. We choose M < N. The delay-embedding of the signal is the matrix so that for j = 0, . . . , . . .
We define a generalized Hankel matrix as an m × n rectangular matrix whose entries for all indices such that 0 ≤ i, i + k ≤ m − 1 and 0 ≤ j, j − k ≤ n − 1, where k ∈ Z.
In simpler terms, a generalized Hankel matrix is a rectangular matrix that is constant on anti-diagonals. A generalized Hankel matrix can also be thought of a submatrix of a larger, regular Hankel matrix. Clearly, the delay-embedding of the scalar signal, Equaion (29), is an example of a generalized Hankel matrix which is why it was denoted as H.
The input matrices for DMD algorithms will be X and Y, where X is the first N − M columns of H and Y is the last N − M columns. Frequencies are calculated as follows: i.e., we just scale the DMD eigenvalues so that they are on the unit circle and then extract frequency of the resulting complex exponential, exp(i2π( f λ ∆x)) = λ/|λ|. Let Ω contain the values ω λ = f λ ∆x. These are directly analogous to the frequencies computed using FFT. We replace the support of the FFT with the support of the DMD frequency as follows. We will take all the values of Ω whose corresponding residual is less than some specified tolerance, tol, and whose corresponding mode's norm is greater than some percentage, η, of the total power of the modes. In other words, we only consider the frequencies which were calculated accurately enough and whose modes give a significant contribution to the signal; mode[j] is analogous to the modulus of a FFT coefficient.
For each mode Vtn[:, i] the energy is c i for which Y[:, 0] − c * Vtn[:, i] 2 is minimal. Total power is defined as (33) Like in determining σ c using FFT, the frequency support of S DMD is defined as where rez[i] is residual corresponding to the i-th mode. These considerations can be summarized in Algorithm 4.

Results
In this section, results of our methods are provided. We demonstrate our methods on explicitly defined function, system of ODEs, and finally on vorticity data from a fluid mechanics simulations. In all cases,

Explicitly Defined Functions
We model explicitly defined functions f (x) on [0, 2]. The dataset consists of 2001 equidistant points in that interval, where every other point is used for the training set. The error is calculated as the maximum absolute value of difference between prediction and actual value (Equation (24)). Results are summarized in Table 1. The parameter is specified as above. Each method predicts a certain number of layers required to push the model error close to . As seen in the table, both the FFT and DMD approaches produce models that give similar errors (near ). Although the DMD approach results in models with less layers in general, it fails for polynomials and the exponential function. This is because we normalize eigenvalues to the unit circle. An extension of this could use the modulus |λ| as well as ω in e λ+iω .

ODE's
We will demonstrate the methods on modeling solutions of the Lorenz system of ODE with initial conditions x 1 (0) = 1.0, x 2 (t) = 1.0, x 3 (t) = 1.0, on [0, 10]. The system evolved solved for 500 equidistant time steps which was used for training set. A solution consisting of 2000 equidistant time steps is then used for test set. Each x(t), y(t), z(t) were regarded as separate signals to model. All hyperparameters are determined as before. Results are summarized in Table 2. For both systems, we can see that error drops nearly to , but the DMD approach results in models with less layers. Table 1. Results for explicitly defined functions, when using scales determined from FFT with decay 2 and given by (35), for e x DMD did not output frequencies different from 0 (entries denoted by * in the corresponding row).

Vorticity Data
This data was provided by Georgia Institute of Technology [24,25]. It contains information about vorticity of a fluid field is some computational box. For each point in space, we want to model vorticity at that point as time evolves. Vorticity in every other time step is used for training and we test our models on rest of the time steps. We analyzed doubly periodic and non periodic data.  For each of these signals, every other point in time is used for training the HSVR model, which results in total of 128 × 128 models. Results are summarized in Table 3. For each model, the error is calculated as in (24) with i = L, where L is number of layers. Since is given by (35), i.e., 1% of the range of the training data, the ratio error/ gives a measure of error in percentages of range of signal. A ratio of 2 would imply that the maximum error of the model over the test set was only 2% of the range of the test data; i.e., 2 = error ⇐⇒ error = 2 = 0.02 max Histograms of these ratios for both FFT and DMD are shown in Figure 7. Most models produced on error close the threshold (a perfect match would give a ratio of 1).

Non Periodic Data
The data contains information about vorticity on 359 × 279 equidistant space-grid on with step dx = dy = 0.05. No assumption of periodicity of boundary condition was made. There are in total 1000 snapshots with time step dt = 1 ms. This results in tensor 359 × 279 × 1000. Similar to the dataset with periodic boundary conditions, the HSVR model is trained for each fixed point in space, which results in 359 × 279 models. The error's and 's are calculated as before. A few examples of such signals are in Figure 8. Results are summarized in Table 4 and a histogram of ratios error/ for FFT and DMD are shown in Figure 9. For both doubly periodic and non periodic data, we can see that a large majority of the models have a ratio between 1 and 2 which means the maximum error of a majority models is error ≤ 2 = 0.02(max {y train } y train − min {y train } y train ) which is less than 2% of the range of the training data. For DMD approach, the majority of models have ratios less 10 which corresponds to a maximum error of error ≤ 10 = 0.01(max {y train } y train − min {y train } y train ), which is less than 10% of the range of the training data. From Tables 3 and 4 we can see that DMD estimates less layers, but with bigger error after training.  (a) Scales determined by FFT (b) Scales determined by DMD Figure 9. Histograms of error/ for models trained on vorticity data with non-periodic boundary conditions. error is the model error given by (24) (with i = L) and is given by (35). There were 359 × 279 = 100,161 total models trained. The count on the vertical axis is the number of models that fell into the corresponding bin.

Discussion
Both approaches (FFT and DMD) to estimating the number of layers required by HSVR to push the modeling error close to the threshold show promise and allow computation of the HSVR depth a priori. While the DMD approach often predicted a smaller number of layers, usually with comparable error, our choice of unit circle normalization prevented it from performing well on functions without oscillation. Furthermore, the DMD algorithm itself comes from the dynamical systems community and requires that the domain of the signal (the x variables) be strictly ordered. For the explicitly defined functions we considered, there was a strict spatial ordering since the domains of the functions were intervals on the real line. For the vorticity data, the time signals at each spatial point were to be modeled and therefore the data points could be strictly ordered in time. For functions whose domain is multi-dimensional, say R 2 , there is no strict ordering and the DMD approach, as formulated here, would break down. The method based on the Fourier transform seems to have more promise in analyzing multivariable, multiscale signals, as it can compute multidimensional wave vectors.
A recent paper [26] also deals with the number of layers of a model and "scales". However, in the mentioned paper, the authors are concerned with how far a signal or gradients will propagate through a network before dying. The scales they compute control how many layers the gradient or signal can propagate before they die. If the network is too deep the gradients go to zero before fully backpropagating through the network, resulting in an untrainable network. The result they compute is a fundamental characteristic of the network and is independent of the dataset. It does not matter what the input data is: constant, single scale, multiscale, etc, the scale parameters they compute are not affected.
Conversely, we are most focused on multiscale signals and tailoring the architecture to best represent such signals. The scales we compute are inherent properties of the dataset, not inherent properties of the network model. The length of the network adapts to the scales contained in the dataset.
It would be interesting in the future to combine the two methodologies, with the methods in [26] giving bounds on the number of layers and then adapting our techniques to analyze the inherent scales in the dataset. This could tell us whether a simple, fully, connected, feedforward network could adequately represent the signal or if something more complex was needed.

Conclusions
In this work we presented method for choosing hyperparameters for HSVR from timeseries data only. We described two approaches, using FFT or DMD. We saw that estimating hyperaparameters with FFT results in a model with more layers and smaller error, whereas the DMD approach gave models that had less layers (and thus more efficient models).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.