Directional Support Vector Machines

Several phenomena are represented by directional—angular or periodic—data; from time references on the calendar to geographical coordinates. These values are usually represented as real values restricted to a given range (e.g., [0, 2π)), hiding the real nature of this information. In order to handle these variables properly in supervised classification tasks, alternatives to the naive Bayes classifier and logistic regression were proposed in the past. In this work, we propose directional-aware support vector machines. We address several realizations of the proposed models, studying their kernelized counterparts and their expressiveness. Finally, we validate the performance of the proposed Support Vector Machines (SVMs) against the directional naive Bayes and directional logistic regression with real data, obtaining competitive results.


Introduction
Several phenomena and concepts in real-life applications are represented by angular data or, as they are referred to in the literature, directional data. Examples of data that may be regarded as directional include temporal periods (e.g., time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations, and so on. The application fields include the study of wind direction as analyzed by meteorologists and magnetic fields in rocks studied by geologists.
The fact that zero degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of two degrees and 358 degrees, provides one illustration that special methods are required for the analysis of directional data.
Directional data have been traditionally modeled with a wrapped probability density function, like a wrapped normal distribution, wrapped Cauchy distribution, or von Mises circular distribution. Measures of location and spread, like mean and variance, have been conveniently adapted to circular data.
The design of pattern recognition systems fed with directional data has either relied completely on these probabilistic models or just ignored the circular nature of the data.
In this work, we formulate for the first time a non-probabilistic model for directional data classification. We adopt the max-margin principle and the hinge loss, yielding a variant of the support vector machine model.
The theoretical properties of the model analyzed in the paper, together with the robust behavior shown experimentally, reveal the potential of the proposed method.

State-of-the-Art
Classical methods for method design include probabilistic and non-probabilistic approaches. Probabilistic approaches come in two flavors, generative modeling of the joint distribution p(x, y) and discriminant modeling of the conditional probabilities of the classes given the input. Non-probabilistic approaches directly model the boundaries of the input space or, equivalently, model the partition of the input space in decision regions.
Directional data classifiers have been typically approached [1][2][3] with generative models based on the von Mises distribution. The von Mises probability density function for the angle x ∈ [0, 2π) is given by: where I 0 is the modified Bessel function of order zero, κ > 0 is the concentration parameter, and µ the mean angle.
Analyzing the posterior probability of the classes, p(y = 1|x) = p(x|y = 1)p(y = 1) p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0) under the von Mises model for the likelihood, it is trivial to conclude that: p(y = 1|x) = 1 1 + e w 0 +w 1 sin(x+θ) , where w 0 , w 1 , and θ are functions of the mean and concentrations parameters. Recently [4], a directional logistic regression has been proposed that fits Model Equation (3) directly from data. In there, the multidimensional setting was naturally extended to: where x i ∈ [0, 2π) and is the ith element in vector x.
Noting that: where w i1 and w i2 are obtained from w i and θ i , the directional logistic regression model is favorably written as: enabling the learning task to be solved with conventional logistic regression, by first applying a feature transformation, where each input feature x i yields two features, cos(x i ) and sin(x i ).

Support Vector Machine
For an intended output y ∈ {−1, 1} and a classifier score f (x|w), the hinge loss of the prediction f (x|w) is defined as L(w|x, y) = max(1 − f (x|w) · y, 0), where x is the input and w is the vector of parameters of the model. The Support Vector Machine (SVM) model solves the problem: where λ ≥ 0 is a regularization parameter. In standard R D spaces, the model f (x|w) is set to the affine form, f (x|w) = w 0 + ∑ D i=1 w i x i , and the previous equation can be equivalently written as: In the trivial unidimensional space, the model boils down to f (x 1 |w 0 , w 1 ) = w 0 + w 1 x 1 , and the partition of the input space is defined by a single threshold; see Figure 1a. In the following, to avoid unnecessarily cluttering the presentation, we will stay in the unidimensional space, returning only in the end to the multi-dimensional problem. We will also assume the period 2π for the directional data.

Symmetric Directional SVM
For directional data, the model f (x 1 |w 0 , w 1 ) has to be adapted, as it should be periodic, continuous, and naturally take positive and negative values in the circular domain (so it can aim to label positive and negative examples correctly). Arguably, the most natural extension of the linear model in R is the piecewise linear model in [0, 2π); see Figure 1b,c. Note that, now, the partition of the input space requires two thresholds.
Motivated by this observation, we explore models of the form: where we start by investigating the following specific realizations for g(x): is the triangle wave with unitary amplitude, period 2π, and maxima at x = 2kπ, k ∈ Z. This function is piecewise linear, and so, it is close to the linear version in the standard domain. 2. g(x) = g s (x), where g s (x) = cos(x). This option can be seen as a rough approximation to the intuitive choice g t (x), but as we will see, analytically more tractable.
While in the standard domain, the linear model is able to express (learn) an arbitrary threshold in the input domain, in the directional domain, we need the ability to express any two thresholds. It is easy to conclude that, when instantiated with g t or g s , the model is able to express two thresholds in the circular domain, whatever their positions are, as formally stated and proven in Proposition 1.
For the g s option, using Equation (5), f (x|w 0 , w 1 , θ) can be equivalently written as f (x|w 0 , . Therefore, we consider the following equivalent model: Similar to the result obtained with the directional logistic regression, the optimization problem in Equation (7) can be efficiently solved by first transforming each directional feature x into two new features, g s (x) and g ⊥ s (x), and then relying on efficient methods for the conventional primal SVM, such as Pegasos [5].
Unfortunately, the analogous equivalence does not hold for the triangle wave g t . For g t , f (x|w 0 , w 1 , θ) cannot be written as w 0 + w 11 g t (x) + w 12 g ⊥ t (x), where g ⊥ t (x) = g t (x − π/2). Still, we could be led to assume the decomposition Equation (10) as a good approximation, when instantiated with g t and use it in practice, with the benefit of using standard SVM toolboxes in pre-processed data. However, the expressiveness of this model is quite limited. For instance, the model w 0 + w 11 g t (x) + w 12 g ⊥ t (x 1 ) is unable to learn two thresholds in [0, π/2]. Since this model is linear in this interval, the result follows.
As such, for g t , we solve the learning task defined by Equation (7) using sub-gradient methods.

Kernelized Symmetric Directional SVM
By the representer theorem, the optimal f (x) in Equation (8) has the form: where k is a positive-definite real-valued kernel and α n ∈ R. Benefiting from the decomposition of each directional feature in two, this formulation is directly applicable to the primal, fixed margin, directional SVM, when using g s . As such, all the conventional kernels can be applied in this extended space. When the model is instantiated with g s , x is mapped in a two-dimensional feature vector, φ(x) = (cos(x), sin(x)) , and the inner product between φ(x) and φ(z) becomes φ(x), φ(z) = cos(x) cos(z) + sin(x) sin(z) = cos(x − z). As such, the feature transformation can be avoided by setting as the kernel the cosine of the angular difference, k(x, z) = cos(x − z).
As seen before, in the case of g t , a similar conclusion does not hold. However, the result for g s suggests also investigating the interest of using the function g t (x − z) as a kernel. We start by presenting Theorem 1, where we show that a broad family of functions, which includes both g s and g t , may be used to construct formally-valid kernels. Theorem 1. Let h : R → R be a periodic function with period T and absolutely integrable over one period. Define g : R → R as the autocorrelation of h, i.e.: Then, k(x, z) = g(x − z) is a kernel function, i.e., there exists a mapping φ from R to a feature space X such that φ(x), φ(z) = g(x − z).

Remark 1.
The triangle wave g t is the autocorrelation, as defined in this paper, of a square wave with amplitude 1/ √ 2π and period 2π.    Thus, the VC dimension of a classifier provides a measure of its expressive power. In Theorem 2 120 we establish a result that determines the VC dimension of the kernelized SVMs we are considering.
be a kernel function where g is defined as in Theorem 1. Furthermore, suppose that g has zero mean value and its Fourier series has exactly N non-zero coefficients. Then, The SVM with the triangle wave in the primal form cannot learn more than two thresholds, therefore some training points in this toy dataset are misclassified.

(b)
The SVM with the triangle wave kernel can learn an arbitrary number of thresholds, classifying every training point in this toy dataset correctly. we establish a result that determines the VC dimension of the kernelized SVMs we are considering.
be a kernel function where g is defined as in Theorem 1. Furthermore, suppose that g has zero mean value and its Fourier series has exactly N non-zero coefficients. Then, Having proven the validity of g t as a kernel, we now focus on investigating the expressiveness of the resulting SVM. The note made in Section 4.1 supports that the sum of triangle functions centered in fixed positions is not expressive enough since it cannot place the decision boundaries in arbitrary positions. However, the kernelized version in Equation (11) can still be appealing, since now, the models are centered in the training observations, and, as such, adapted in number and phase to the training data. For the purpose of this analysis, the notion of the Vapnik-Chervonenkis (VC) dimension [6], given in Definition 1, will be useful.

Definition 1.
A parametric binary classifier l(x|w), with parameters w, is said to shatter a dataset x = (x 1 , ..., x N ) if, for any label assignment y = (y 1 , ..., y N ), there exist parameters w such that l(x|w) classifies correctly every data point in x. The VC dimension of l(x|w) is the size N of the largest dataset that is shattered by such a classifier.
Thus, the VC dimension of a classifier provides a measure of its expressive power. In Theorem 2, we establish a result that determines the VC dimension of the kernelized SVMs we are considering.
be a kernel function where g is defined as in Theorem 1. Furthermore, suppose that g has zero mean value and its Fourier series has exactly N non-zero coefficients. Then, the VC dimension of the classifier: The Fourier series of the triangle wave g t has an infinite number of non-zero coefficients, and therefore, the classifier instanced with the triangle wave kernel k(x, z) = g t (x − z) has infinite VC dimension. On the other hand, the VC dimension of the classifier instanced with the cosine kernel k(x, z) = cos(x − z) equals three. Consequently, the SVM with the triangle wave kernel is able to express an arbitrary number of thresholds in the circular domain, unlike the SVM with the cosine kernel or with the triangle wave in the primal form, which, as proven before, can only express two thresholds in [0, 2π). Figure 2 illustrates these differences.
However, depending on the relative position of the data points, even the SVM with the triangle wave kernel may fail to assign the correct label to all of them. In order to overcome this limitation, composite kernels, constructed from this baseline, can be explored. Typical cases include the polynomial directional kernel, k(x, z) = (g(x − z) + 1) d , where d is the polynomial degree and the directional RBF kernel, k(x, z) = exp (κg(x − z)) (while the standard RBF kernel relies on the Gaussian expression, the directional RBF kernel relies on the expression of the von Mises distribution). Figure 3a shows a simple training set that is correctly learned both with the primal and kernel triangle wave formulations. On the other hand, it should be clear that the setting in Figure 3b cannot be correctly learned by these same models. In this case, setting the SVM with the kernel k(x, z) = (g t (x − z) + 1) 2 achieves the correct labeling.
It is important to note that standard, off-the-shelf, toolboxes can be used to solve the kernelized directional SVM directly. One just needs to properly define the kernel as discussed before.

Asymmetric Directional SVM
In Figure 4, we portray a toy dataset together with the model that optimizes Equation (7) using g t in the model f (x). As observed, the margin is determined by the "worst case" transition between positive and negative examples. It is reasonable to assume that a model placing the second threshold centered in the gap between positive and negative examples would generalize better. Shashua and Levin [7] faced a problem with similar characteristics when addressing ordinal data classification in R D . Similar to them, we propose to maximize the sum of the margins around the two threshold points. Towards that goal, we only need to generalize the model to allow independent slopes in the two parts of the triangle wave, setting f (x|w 0 , w 1 , θ, ζ) = w 0 + w 1 g asy (x + θ, ζ), where: Here, ζ controls the asymmetry of the wave: if ζ → 0, the wave has infinite ascending slope; if ζ → 1, the wave has infinite descending slope, and for ζ = 0.5, it coincides with the symmetric case, g t . The wave is depicted in Figure 5 for some values of ζ.
It should be clear that the model instanced with g asy retains the same expressiveness as before, being able to express two thresholds (and not more than two) in any position in [0, 2π).
As before with g t , it is not possible to solve the optimization problem as a conventional setting, and we again directly optimize the goal function using sub-gradient methods.
Appl. Sci. 2019, xx, 5 x  e asymmetry of the wave: if ζ → 0, the wave has infinite ascending slope; if ζ → 1, ite descending slope and, for ζ = 0.5, it coincides with the symmetric case, g t . The n Figure 5 for some values of ζ. clear that the model instanced with g asy retains the same expressiveness as before, ess two thresholds (and not more than two) in any position in [0, 2π). th g t , it is not possible to solve the optimization problem as a conventional setting, ctly optimize the goal function using sub-gradient methods.

mmetric directional SVM
ated by the behavior with g s and the representer theorem, we explored models

Kernelized Asymmetric Directional SVM
Before, motivated by the behavior with g s and the representer theorem, we explored models of the form f (x) = w 0 + ∑ N n=1 α n g t (x − x n ). Using the decomposition depicted in Figure 6, g t (x) = g t 1 (x) + g t 2 (x), we can rewrite the model as x − x n )). We can now gain independence in the two slopes of the model by extending it to w 0 + ∑ N n=1 α n 1 g t 1 (x − x n ) + α n 2 g t 2 (x − x n ), where α n 1 and α n 2 are two independent parameters to be optimized from the training set.
Since g t 2 (x) = g t 1 (−x), the model equals: Figure 6. Decomposition of the triangle wave.

The Multi-Dimensional Setting
The extension of the ideas presented before to the multi-dimensional setting is easy. For this purpose, assume our data consist of both directional and non-directional components. This allows each data example to be represented as a vector x = x (d) x (l) , where x (d) ∈ R D represents the directional components and x (l) ∈ R L represents the non-directional ones. Suppose we wish to represent the ith directional component x i , and the non-directional ones in a feature space X (l) , through a mapping φ (l) : R L → X (l) . Then, our model f (x|w) becomes: . Therefore, in the standard setting where the feature spaces are fixed and possibly infinite dimensional, but the respective inner products have a closed form, we may use the kernel trick to solve the optimization problem. Such kernel is an inner product in the joint feature D × X (l) and equals the sum of the individual kernels: where k i (·) and k (l) (·, ·) = φ (l) (·), φ (l) (·) . If the feature mappings φ (d) i are finite dimensional functions that depend also on parameters to be optimized, like, for instance, in the case of φ i + θ i , the kernel itself becomes dependent on such parameters. In this setting, we opted to plug Equation (16) directly into Equation (7), solving the problem directly in its primal form using gradient-based optimization.
For simplicity, we set φ (l) to the identity in our experiments, inducing the usage of the linear kernel for all the non-directional components.

Experiments
In this section, we detail the experimental evaluation of the proposed directional support vector machines against two state-of-the-art directional classifiers: the von Mises naive Bayes [8] and the directional logistic regression [4]. Following [4], the κ parameter of the von Mises distribution was approximated by 100 iterations (a much larger number of iterations than required to have good convergence values) of Newton's method proposed by Sra [9].
The SVM regularization constant C = λ −1 was chosen using a stratified 3-fold cross-validation strategy. The range of explored values was 10 −3 , . . . , 10 3 . The concentration parameter κ of the directional RBF kernel was also selected through 3-fold cross-validation in the range 10 −3 , . . . , 10 1 . The primal directional SVM with fixed margin was randomly initialized and optimized using Adam [10] for 500 iterations. On the other hand, we initialized the asymmetric primal SVM with the fixed margin margin parameters after 400 iterations. Then, all the parameters were fine-tuned for an additional 100 iterations. Using pre-trained parameters for the SVM with an asymmetric margin facilitates convergence given the coupled effect of the θ and ζ parameters. The kernelized directional version with cosine and triangular kernels was optimized using the standard libsvm [11], which implements an SMO-type algorithm proposed in [12]. For the asymmetric kernel, we used the aforementioned fine-tuning approach in order to fine-tune the α coefficients obtained by the standard toolkit.
We validated the advantages of the proposed approach using 12 publicly-available datasets: Arrhythmia [13], Behavior [14], Characters [15], Colposcopy, Continents, eBay [16], MAGIC [17], Megaspores [18], OnlineNews [19], Temperature1, Temperature2, and Wall [20]. Relevant properties about these datasets (e.g., number of directional and non-directional features, number of classes, dimensionality) are presented in Appendix B. Experiments in previous works [4,8] have shown that directional classifiers outperform traditional ones in these datasets, proving that directionality is an important attribute to exploit. Further details about the datasets, including their acquisition and preprocessing, were presented in [4]. Additionally, in order to facilitate the convergence of the SVM-based models, all the non-directional features were scaled to the range 0-1.
Multiclass instances were handled using a one-versus-one approach for all the binary models (i.e., logistic regression and support vector machines). All the experiments detailed below were executed with a 3-fold stratified cross-validation technique (i.e., by preserving the percentage of samples for each class), selecting the best model in terms of accuracy, and the results of 30 different runs were averaged. Specifically, for each model and dataset, we have evaluated the accuracy and the macro F 1 -score, which corresponds to the unweighted mean value of the individual F 1 -scores of each class. Results of these experiments are summarized in Tables 1 and 2, exhibiting average accuracy and macro F 1 -score, respectively, for 30 independent runs. The best results for each dataset are marked in bold. For reproducibility purposes, the source code and the training-testing partitions are made available (https://github.com/dpernes/dirsvm). The results achieved by the von-Mises naive Bayes (vMNB) and directional Logistic Regression (dLR) align with the results reported in the literature [4].   Table 2. Average macro F 1 -score and standard deviation in the same setting as described in Table 1. Best results are given in bold.  [13] 62.40 ± 6.3 77.80 ± 6.6 77.10 ± 5.3 77.12 ± 5.7 77.14 ± 6.5 75.21 ± 6.5 77.38 ± 6.2 75.02 ± 6.3 eBay [16] 83.64 ± 6.0 85. 13  Hereafter, we will denote by non-Kernelized directional SVMs (nK-dSVM) the subset of proposed SVM variations with VC dimension equivalent to the one induced by the directional logistic regression; namely, the primal fixed-margin directional SVM with triangle (symmetric and asymmetric) and cosine waves. The remaining models (i.e., directional RBF, symmetric, and asymmetric kernels) will be referred to as Kernelized directional SVMs (K-dSVM).

Dataset vMNB dLR dRBF-SVM cos-SVM t-SVM T-SVM a-SVM
Although some datasets used here are considerably imbalanced, accuracy and macro F 1 -score values were fairly consistent with each other, in the sense that the best model in terms of accuracy was the top-1 model in terms of macro F 1 -score in 10 of 12 datasets and was among the top-2 models in all datasets. While the dLR achieved a competitive general performance, it was surpassed by at least one of the proposed SVM alternatives in most cases. nK-dSVM performed better than dLR on small datasets, given the margin regularization imposed by the SVM loss function. For larger datasets, dLR performed better since the generalization induced by the nK-dSVM margin became less relevant. However, for large datasets, K-dSVM surpassed dLR and their non-kernelized counterparts in most cases. In general, dSVM with asymmetric margins (kernelized and non-kernelized) attained the best results, obtaining the best average performance on half of the datasets.
As shown in Section 5, kernels involving the triangle wave correspond to inner products in an infinite-dimensional feature space. The same is also true for the directional RBF kernel. Non-kernelized methods, on the other hand, are constructed by explicitly defining the feature transformation, having a necessarily finite VC dimension. Therefore, the former produce models with higher capacity, which may lead to overfitting in small datasets, but better accuracy for large ones. This is confirmed by our experiments: the non-kernelized models achieved the best results in small datasets, while kernelized models built on top of the triangle wave and directional RBF kernel attained the best results in large datasets. The performance gains of kernelized models on the larger datasets were small, however, which may be explained by the unimodal distribution of the angular variables. On datasets with a multi-modal distribution of the directional variables, it is expected to observe higher gains by K-dSVM.

Towards Deep Directional Classifiers
Deep neural networks have achieved remarkable results in multiple machine learning problems and, particularly, in supervised classification. SVMs, on the one hand, typically decouple the data representation problem from the learning problem, by first projecting the data into a prespecified feature space and then learning a hyperplane that separates the two classes. Deep networks, on the other hand, jointly learn the data representation and the decision function, exhibiting superior performance mostly when trained on large datasets. In the context of directional data, we argue that significant performance improvements might be attained by combining the angular awareness of directional feature transformations or kernels with the representation learning provided by deep neural networks.
In order to evaluate the potential of deep classifiers for directional data, we present two further experiments in this section. Specifically, we have trained two Multilayer Perceptrons (MLPs), which were essentially identical, except for one important difference: one of them (denoted by rMLP) was trained on top of raw angle values (normalized to lie in a single period); the other one (denoted by dMLP) was trained on top of the feature transformation φ(x) = (cos(x), sin(x)) , which defines the cosine kernel, applied to all angular components. The latter was a first attempt towards deep directional classifiers, while the former was completely unaware of the directionality of the data. Each hidden layer in the MLPs had the following structure: fully-connected transformation (dense layer) with 256 output neurons + batch normalization [21] + ReLU + dropout [22]. The output layer is a standard fully-connected transformation followed by a sigmoid, in the case of binary classification, or a softmax, when there are more than two classes. The models were trained to minimize the usual cross-entropy loss with 2 regularization. The total number of layers was chosen between 4 and 5 using 3-fold cross-validation, together with the remaining hyperparameters (dropout rate, 2 regularization weight, and learning rate). Training was performed for 200 epochs or until the loss plateaued. The training protocol, including the evaluated datasets, the number of runs for each dataset, and the evaluated metrics, was exactly the same as in the previous set of experiments.
Results are in Table 3a,b, where we show again the values of our most accurate SVM (denoted by best-SVM) in each dataset for easier comparison. Like before, we observed high consistency between the accuracy and macro F 1 -score values. As expected, rMLP had the worst overall performance, and this effect was mostly apparent in small datasets where the number of directional features was in the same order of magnitude as the number of non-directional ones, like Colposcopy and eBay (see Appendix B). In larger datasets and in those where the number of non-directional features was much larger than the number of directional ones (e.g., Behavior, OnlineNews), rMLP achieved more competitive results. The exception was the MAGIC dataset, where the single directional feature seemed to have a high discriminative power, and so, rMLP achieved the lowest performance among the three models. Contrary to what we just observed for rMLP, the gains of dMLP were highly encouraging. This model, built on top of a directional feature transformation, generally outperformed best-SVM in larger datasets and achieved competitive results even in smaller ones. This observation reinforces the role of directionality in these datasets and, more importantly, motivates the importance of further research to merge directional feature transformations and/or kernels with deep neural networks, which we plan to develop as future work.

Conclusions
Several concepts in real-life applications are represented by directional variables; from periodic time representation on calendars to compass directions. Traditional classifiers, which are unaware of the angular nature of these variables, might not properly model the data. Thereby, the study of directional classifiers is relevant for the machine learning community. Previous attempts to address classification tasks with directional variables focused on generative models [8] and discriminative linear models (logistic regression) [4].
In this work, we proposed several instantiations of directional-aware support vector machines. First, we modified the SVM decision function by considering parametric periodic mappings of the directional variables using cosine and triangle waves. Then, we proposed an extension of the model with triangular waves in order to allow asymmetric margins on the circle. The kernelized versions of these models were proposed as well. Furthermore, we analyzed and demonstrated the expressiveness of each proposed alternative.
In the experimental assessment, the relevance of the proposed models was evaluated, being able to achieve competitive results in most datasets. As expected, when compared to other shallow directional classifiers, kernelized models built on top of the triangle wave attained the best results in larger datasets, due to their large expressive power, which we have proven theoretically. One extra experiment combining a directional feature transformation and a deep neural network showed very promising results and clearly motivates further research.
Since the additional parameters involved in our asymmetric SVMs (in both kernelized and non-kernelized versions) have a periodic impact on the decision boundary or are constrained to a specific domain, using gradient-based optimization techniques may result in sub-optimal models. While this problem was circumvented by using fine-tuning from simpler models, there is research room for the design and exploration of optimization techniques specific for these models. Furthermore, deep multiple kernel learning [23] is an unexplored research line in directional data settings that may lead to a unified framework combining directional kernel machines and deep neural networks.
Author Contributions: K.F. and J.S.C. motivated the problem and designed the proposed models; D.P. conceived the mathematical proofs in the Appendix and wrote most parts of the paper; D.P. and K.F. conducted the experiments; J.S.C. supervised the work.
Funding: This work was partially financed by the ERDF (European Regional Development Fund) through the Operational Programme for Competitiveness and Internationalisation (COMPETE) 2020 Programme and by National Funds through the Portuguese funding agency, FCT (Fundação para a Ciência e a Tecnologia) within project "POCI-01-0145-FEDER-028857" and also by Fundação para a Ciência e a Tecnologia within Ph.D. Grant Numbers SFRH/BD/93012/2013 and SFRH/BD/129600/2017.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: From the definition of g, we may verify that it is an even function: Because g is also periodic with period T, it may be expressed in a Fourier series of the form: Thus, where: √ a n cos 2nπ· T , √ a n sin 2nπ· T , ...

(A4)
Therefore, if φ(x) and φ(z) are real vectors, the product φ(x) φ(z) is an inner product, and so, g(x − z) is a kernel function. Clearly, φ(x) and φ(z) are real vectors if and only if a n ≥ 0, ∀ n, which can be proven to be true, concluding the proof: and, for n ≥ 1, Appendix A.2. VC Dimension: Proof of Theorem 2 Before going into the details of the proof of Theorem 2, we need the result presented in Lemma A1.
Lemma A1. Let F = {φ 1 , · · · , φ N } be a set of square integrable and non-zero functions φ i : I ⊆ R → R, where I is an interval, that satisfy: There exists a vector x = (x 1 , ..., x N ) ∈ I N such that the matrix: has full rank.
Proof. We shall prove by contradiction that such x actually exists. Suppose that Φ(x) is rank deficient for all x. Then, there exists a function v : I N → R N such that: Φ(x)v(x) = 0 and v(x) = 0, ∀ x.
Due to the orthogonality of the functions in F, no non-trivial linear combination of them vanishes identically in I, and consequently, v may not be a constant function. However, if v is not constant, there exist distinct vectors x (1) , ..., x (k) with dimension N such that no non-zero vector belongs to the null spaces of all the matrices Φ x (1) , ..., Φ x (k) . This means that the space generated by the rows of these matrices stacked altogether has dimension N, so we may choose N linearly-independent rows from this stacked matrix. Since each row is defined by a single element in one of the vectors x (1) , ..., x (k) , choosing N linearly-independent rows corresponds to finding a dataset x * with size N such that Φ(x * ) has full rank, contradicting the initial hypothesis. Now, we may proceed to the proof of Theorem 2. Let VC(l) denote the VC dimension of the class of classifiers l. Firstly, we are going to prove that VC(l) ≤ 2N + 1. If we proceed as in the proof of Theorem 1, φ may be defined as a feature space over R 2N , by suppressing all components whose coefficients are zero. Therefore, f becomes a hyperplane in R 2N , and so, it cannot shatter more than 2N + 1 data points. Now, it suffices to prove that VC(l) ≥ 2N + 1. Let us denote the period of g by T and consider a dataset withN = 2N examples, namely x = (x 1 , ..., xN) , where 0 ≤ x i < T, ∀i. Like before, we obtain φ as in the proof of Theorem 1, and we build theN ×N matrix Φ, defined as: . . .

φ(xN)
By further defining f (x|w 0 , w d ) = ( f (x 1 |w 0 , w d ), ..., f (xN|w 0 , w d )) , the following equality is straightforward to check: where ⊕ denotes the operation of summing the scalar on the left-hand side to every element of the vector on the right-hand side. Let us denote the ground truth label of each x i as y i ∈ {−1, 1}. Because the elements of φ(x) form a set of orthogonal functions, we know from Lemma A1 that there exists a dataset x such that Φ(x) has full rank. Thus, from now on, assume this condition holds. This assumption ensures that, for all possible combinations of y i values, there exists a w d ∈ RN that satisfies: Clearly, using this w d , the classifier labels allN data points correctly provided that |w 0 | < 1. Now, suppose we have one more data point xN +1 ∈ [0, T), with ground truth label yN +1 ∈ {−1, 1}. Assume, furthermore, that xN +1 is such that |w d φ(xN +1 )| < 1. The existence of such xN +1 is guaranteed, since the elements of φ are continuous functions with zero mean value. By setting w 0 = yN +1 δ − w d φ(xN +1 ), for any arbitrary 0 < δ < 1 − |w d φ(xN +1 )|, the classifier labels allN + 1 data points correctly. Thus, VC(l) ≥N + 1 = 2N + 1.

Appendix B. Summary of the Datasets
We give a summary of the main characteristics of the datasets used in this work, including number of features per type (i.e., Directional (Dir), Linear (Lin), Discrete (Disc)) and the number of samples per dataset (#).