Generalized Eigenvalue Proximal Support Vector Machine for Functional Data Classiﬁcation

: Functional data analysis has become a research hotspot in the ﬁeld of data mining. Traditional data mining methods regard functional data as a discrete and limited observation sequence, ignoring the continuity. In this paper, the functional data classiﬁcation is addressed, proposing a functional generalized eigenvalue proximal support vector machine (FGEPSVM). Speciﬁcally, we ﬁnd two nonparallel hyperplanes in function space, a positive functional hyperplane, and a functional negative hyperplane. The former is closest to the positive functional data and furthest from the negative functional data, while the latter has the opposite properties. By introducing the orthonormal basis, the problem in function space is transformed into the ones in vector space. It should be pointed out that the higher-order derivative information is applied from two aspects. We apply the derivatives alone or the weighted linear combination of the original function and the derivatives. It can be expected that to improve the classiﬁcation accuracy by using more data information. Experiments on artiﬁcial datasets and benchmark datasets show the effectiveness of our FGEPSVM for functional data classiﬁcation.


Introduction
The concept of functional data was first put forward by Ramsay and showed that functional data can be obtained through advanced equipment [1].Then, these data can be regarded as dynamic and continuous data, rather than static and discrete data in the traditional data analysis.Ramsay and Dalzel further proposed a systematic tool for functional data analysis (FDA) and promoted some traditional analysis methods [2].Ramsay and Silverman popularized functional data analysis and have brought widespread attention for the last two decades [3].In recent years, with the improvement of data acquisition technology and data storage capacity, more and more complex data are constantly appearing, presented in the form of functions.These data appear in every aspect of human production and life, such as speech recognition, spectral analysis, meteorology, etc. Functional data contains more complete and sufficient information, which can avoid information loss caused by discrete data.Therefore, it has become a research hotspot [4][5][6][7][8].
We deal with the problem of classifying functional data.There are numerous functional data classification problems [9][10][11][12][13].For example, a doctor can determine whether the arteries of a patient are narrowly based on a Doppler ultrasonography, and marble can be automatically graded using the spectral curve.Many scholars have researched this area.Rossi and Villa showed how to use support vector machines (SVMs) for functional data classification [14].Li and Yu suggested a classification and feature extraction method on functional data where the predictor variables are curves; it combines the classical linear discriminant analysis and SVM [15].Mu ñoz proposed a functional analysis technique to obtain finite-dimensional representations of functional data.The key idea is to consider each functional datum as a point in public function space and then project these points onto a Reproducing Kernel Hilbert Space with SVM [16].Chang proposed a new semi-metric based on wavelet thresholding for functional data classification [17].Martin-Barragan faced the problem of obtaining an SVM-based classifier for functional data with good classification ability and provides a classifier that is easy to interpret [18].Blanquero proposed a new approach to optimally select the most informative time instants in multivariate functional data to get reasonable classification rates [19].Moreover, other researchers applied different approaches to solve the functional data classification problems [20][21][22].
SVMs are one of the most comment methods to address classification.According to supervised learning, it is a generalized linear classifier.Its decision boundary maximizes the margin between the hyperplane of the training set [23].The pioneering work can be traced to a study by Vapnik and Lerner in 1963 [24].Vapnik and Chervonenkis established the linear SVM with a hard margin in 1964 [25].Boser, Guyon, and Vapnik obtained the nonlinear SVM through the kernel method in 1992 [26]. Cortes and Vapnik proposed the nonlinear SVM of soft margins in 1995 [27], and this study received attention and citation after publication.SVM seeks a hyperplane and finally solves a quadratic programming problem in the dual space.As an extension of the SVM, the nonparallel support vector machine (NPSVM) constructs corresponding support hyperplanes for each class to describe the difference in data distribution between different classes [28][29][30][31].The idea was first proposed by Magasarian in 2006, who came up with a generalized eigenvalue proximal support vector machine (GEPSVM) [32].The model obtained two nonparallel hyperplanes by solving two generalized eigenvalue problems [33].GEPSVM has two advantages over SVM.The first is the computation time, the former is solving a pair of generalized eigenvalue problems, while the latter is solving a quadratic programming problem, obviously because the former has low computational complexity.Next, GEPSVM performs better than SVM on crossed datasets.
This paper considers the binary functional data classification.Inspired by the success of GEPSVM, our proposal finds two nonparallel hyperplanes in function space.Each of the functional hyperplanes is generated such that it is close to the functional data in one class while is far away from the other class.This yields the optimization problem in function space.Then it is transformed into the vector space through the orthonormal basis.It is worthwhile to mention that our methodology is not only restricted to the original function.Indeed, the approach proposed here can be directly applied to higher-order derivatives.Only the functional data can employ the higher-order derivative information, which is impossible for the discrete data in the traditional data analysis because the discrete data does not have the property of functions.We mainly consider higher-order derivative information from two aspects.First, using derivative alone can improve the classification accuracy when the difference between original functions may not be distinct in the training set.Moreover, we employ the weighted linear combination of derivatives, which contains more information to improve the classification accuracy.The former is a particular case of the latter.Our model is based on the weighted linear combination of derivatives, and this paper considers the original function, first derivative, and second derivative in the experiment.The numerical experiment shows that the higher-order derivative information will be crucial in the classifier performance.Moreover, when we apply the weighted linear combination of the original function and the derivatives, the performance of our method is excellent on some datasets as well, which indicates that it is feasible to employ the weighted linear combination.Two main advantages are obtained from our concept.One is the classification accuracy of FGEPSVM on some datasets is better than that of SVM, such as crossed datasets.Moreover, given that the computational complexity of GEPSVM is much lower than SVM, especially when dealing with large-scale datasets.
The overall structure of this paper is as follows.Section 1 introduces the research background and significance.Section 2 reviews the related work.Section 3 introduces the functional data and proposes our method, FGEPSM, to solve the functional data classification.Section 4 is the numerical experiment, which is divided into three parts.The first part is an experiment on artificial datasets, which demonstrates the feasibility of the proposed method.In the second part, experiments on classical functional datasets are carried out to further illustrate the effectiveness of our proposal.The last part is the parameter analysis, which discusses the influence of the important parameter on classification accuracy.Finally, the summary part briefly describes the content of this paper and the proposed methods.

Related Work
Here, we introduce the basic notation.Throughout this paper, uppercase and lowercase characters are used to denote matrices and vectors.R m×n and R n denote the set of m × n matrices and n-dimensional vectors, respectively.The inner product of two vectors x and y in the n-dimensional real space R n is defined by < x, y >= x T y.Furthermore, the inner product of two functions x(t) and y(t) in the square-integrable function space L 2 (T) is denoted by < x(t), y(t) >= T x(t)y(t)dt.The norm in this paper refers to the 2-norm of a vector, which is denoted by x = ( , where x = (x 1 , x 2 , . . ., x n ) T .
Our work is closely related with the generalized eigenvalue proximal support vector machine (GEPSVM) [32].Given a training set where and where and min where and Ω( f − ) are regularization terms.
We organize the m + data points of Class +1 by X + = (x 1 , . . ., x m + ) T ∈ R m + ×n and the m − data points of Class −1 by X − = (x m + +1 , . . ., x m + +m − ) T ∈ R m − ×n .Selecting the Tikhonov regularization terms [34] to be Ω( f with a nonnegative parameter δ, (4) and ( 5) can be written as min and min where e + and e − are all ones vectors with length m + and m − , respectively.Starting from the identity matrix I of appropriate dimensions, define the symmetric matrices G, H, R, S ∈ R (n+1)×(n+1) and the vectors z then the optimization problems become min Note that the objective functions in (10) and ( 11) are known as the Rayleigh quotients, which can be transformed into a pair of generalized eigenvalue problems and Obviously, the global minimum of ( 10) is achieved at the eigenvector z * + of the generalized eigenvalue problem (12) corresponding to the smallest eigenvalue.Similarly, the global minimum z * − of ( 11) can be obtained.At last, the solutions A new sample x ∈ R n is assigned to either class +1 or class −1, according to which of the nonparallel hyperplanes it is closer to, i.e.,

Functional GEPSVM
Now we are in a position to address the functional data classification problem.Given a training set where x i (t) is the functional data from the square-integrabel space L 2 (T), t ∈ T. T is the interval in R, which is the domain of definition of Corresponding to ( 2) and (3), we seek two nonparallel hyperplanes in function space L 2 (T) and where It is required that each of the functional hyperplanes is generated such that it is close to the functional data in one class while is far away from the functional data in the other class.Corresponding to ( 4) and ( 5), the optimization problems are formulated as min and min where and Ω( f − ) are regularization terms.Our methodology is not restricted to use the original functional data only, but also its j-th derivative x i (t) is defined as the original functional data x i (t).So, more generally, it is more flexible to replace the original functional data x i (t) Correspondingly, the original function data x i (t) in ( 18) and ( 19) should be taken place by the weighted linear combination i (t) as well.This leads to the following optimization problems min and min where and Ω( f − ) are regularization terms.Next, we shall transfer the optimization problem in function space into the ones in vector space.In the square-integrable space, L 2 (T), so we need the following proposition: Let {φ k (t)} ∞ k=1 be the complete orthonormal basis in L 2 (T).Suppose that the above basis is used to expand two functions u(t) , v(t) ∈ L 2 (T) (23) where φ(t) = (φ 1 (t), φ 2 (t), . . ., φ l (t)) T , u = (u 1 , u 2 , . . ., u l ) T and v = (v 1 , v 2 , . . ., v l ) T .Then, according to the property of the orthonormal basis, we get From the above proposition, the inner product between x i (t) and w + (t), w − (t) can be approximated by the inner product between their coefficient vectors.More precisely, for the orthonormal basis φ(t) = (φ 1 (t), φ 2 (t), . . ., φ l (t)) T in L 2 (T), we have where c = (c 0 , c 1 , . . ., c d ) T is the coefficient of weighted linear combination, x ik is the coefficient of the i-th functional data with the j-th derivative corresponding to the k-th basis function, In the same way, we get where w + = (w + 1 , w + 2 , . . ., w + l ) T , w − = (w − 1 , w − 2 , . . ., w − l ) T .Therefore, ( 21) and ( 22) in function space can be transformed into the ones in vector space min and min By defining where ⊗ is Kronecker product, I + and I − are identity matrix of appropriate dimensions.Equations ( 27) and ( 28) are transformed into the following form: min and min where e + and e − are all ones vectors with length m + and m − , respectively.Ω( f + ) = To date, we transfer the optimization problem in function space into the vector space, and then we can solve (31) and (32) according to the method in vector space.
Starting from the identity matrix I of appropriate dimensions, define the symmetric matrices P, Q, U, V ∈ R (l+1)×(l+1 and the vectors z + , z − ∈ R (l+1) we can reformulate (31) and (32) as and The above are exactly Rayleigh quotients and the global optimal solutions to the minimization problems can be readily computed by solving the following two generalized eigenvalue problems and More precisely, the minimum of ( 35) is achieved at the eigenvector z * + corresponding to the smallest generalized eigenvalue.The same is true for the optimal solution z * − of (36).
At last, the solutions A new functional data x(t) ∈ L 2 (T), it can be represented by the orthonormal basis as

Numerical Experiments
This section illustrates the performance of our proposal in three artificial datasets and six benchmark datasets.Comparing the performance of FGEPSVM against FSVM [14], we analyze the improvements in performance obtained when, instead of the functional data alone, up to d derivatives of the functional data are also included in the input.For this reason, we experiment with three different values of d, namely d = 0, 1, 2. This is a special case of our proposed weighted linear combination.Many scholars have used the higher-order derivative information of functional data to improve classification accuracy.We further consider the weighted linear combination between the original function, the first derivative, and the second derivative.There are three corresponding parameters, c 0 , c 1 , and c 2 , and we will adjust these three parameters to select the optimal parameter combination.
To obtain stable results, k-fold cross-validation is performed.The number of folds, k, depends on the dataset considered.Particularly, if a database is small, k coincides with the number of individuals.That is to say, leave-one-out is applied.For a largescale dataset, k = 10 has been chosen.This paper considers that a dataset is small if its cardinal is smaller than 100 individuals.We also apply k-fold cross-validation to select the optimal parameters in the training process.There is an essential parameter in our method, the regularization parameter δ.At the end of the numerical experiment, we analyze the effects of the parameter on the classification accuracy.We experiment on Matlab, and the processor of our computer is 11th Gen Intel(R) Core(TM) i5-1135G7.

Artificial Datasets
In order to illustrate the performance of our method, we use two classification methods on artificial datasets.We describe the following three artificial datasets.
Where U + is a uniform random variable on the interval (0, 1), and U − is a uniform random variable on the interval (0, 1), is Gaussian noise.The positive functions and negative functions in Example 1 and Example 2 are evaluated in 50 points between −1 and 1, respectively.In the same way, the positive functions and negative functions in Example 3 are evaluated in 100 points between 0 and 2π, respectively.Figure 1   We perform a 10-fold cross-validation on these three artificial numerical examples to obtain the final experimental results and use the boxplot to denote the results, as shown in Figure 2. At the same time, we record the CPU time of two methods on the three datasets, as shown in Figure 3.It should be noted that the CPU time of the two methods is very different in Example 3, which leads to the time of FGEPSVM can hardly be seen when drawing the bar chart, so we have to deal with the time to some extent.We use e t instead of t, and t is the CPU time.We can see from Figure 2 that the results of the method in this paper are superior to FSVM on all three artificial datasets.The accuracy of FGEPSVM on the three artificial datasets are 0.7520, 0.7130 and 0.8985, respectively, while the corresponding results of FSVM are 0.6280, 0.6160 and 0.8780.Furthermore, our method is more stable on these three datasets.Figure 3 shows that our method has great advantages in terms of computation time.It is worth noting that nonparallel hyperplanes have certain advantages in dealing with crossed datasets.Therefore, we would like to explore whether the method proposed in this paper has the same advantages for crossed functional data in the functional space.The result of Example 1 illustrates this problem.The above artificial examples prove the feasibility and effectiveness of the proposed method.

Benchmark Datasets
We start with a detailed look at the six benchmark datasets.Coffee dataset is a twoclass problem to distinguish between Robusta and Arabica coffee beans.ECG dataset contains the electrical activity recorded during one heartbeat and the two classes are a normal heartbeat and a Myocardial Infarction.Ham dataset includes measurements from Spanish and French dry-cured hams.The above three datasets are from the UCR time series classification and clustering website at http://www.cs.ucr.edu/~eamonn/time_series_data/(accessed on 7 July 2020).Weather dataset consists of one year of daily temperature measurements from 35 Canadian weather stations, and the two classes are considered depending if the total yearly amount of precipitations are above or below 600 [3].Growth dataset consists of 93 growth curves for a sample of 54 boys and 39 girls, and the observations were measured at a set of thirty-one ages from one to eighteen years old [3].Tecator dataset consists of 215 near-infrared absorbance spectra of meat samples.Each observation consists of a 100-channel absorbance spectrum in the wavelength range 8,501,050 nm.The goal here is to predict whether the fat percentage is greater than 20% from the spectra.The dataset is available at http://lib.stat.cmu.edu/datasets/tecator(accessed on 1 January 2020).Table 1 shows the number of each dataset, dimension, and the functions of each class.We conduct experiments to evaluate the performance of FGEPSVM on six functional datasets.We record the averaged accuracy and CPU time on the testing set provided with the information given by the data (d = 0), the first derivative (d = 1), and the second derivative (d = 2).Leave-one-out is performed on the weather, coffee, and growth datasets, whereas 10-fold cross-validation is implemented in ECG, ham, and tecator datasets.The classification results of the two methods are given in Table 2.The best classification accuracy of the two methods are highlighted in bold type.In Table 2, FGEPSVM performed better than FSVM on most datasets.It is important to note that when the two methods produce the same results on the dataset, we consider our method more advantageous.This is based on computation time because we solve a pair of generalized eigenvalue problems, while the latter solves a quadratic programming problem.When dealing with the same dataset, the former calculation speed is significantly faster than the latter.Therefore, when the classification results are the same, we believe that FGEPSVM is more advantageous than FSVM.Furthermore, FSVM takes much more time than FGEPSVM on large-scale datasets, so that our approach has advantages in dealing with large-scale datasets.
Moreover, using higher-order derivative information also has a particular impact on classification performance.Higher-order derivative information can improve classification accuracy reflected in weather, growth, and tecator datasets.The classification accuracy of the second derivative for the weather dataset is much higher than that of the original function and the first derivative.Similarly, the classification results of the first derivative for the growth dataset are much higher than that of the original function and second derivative.Compared with the classification accuracy of the original function results, using the first derivative and the second derivative are improve the result for the tecator dataset.However, not all higher-order derivative information of datasets can do this.Therefore, when we solve functional data classification, we can appropriately consider using higherorder derivative information to improve the classification accuracy.In this paper, the first derivative and second derivative are considered in the numerical experiments.Any derivative can be used, which depends on the characteristics of the datasets.
We further apply the weighted linear combination of derivatives.The dimension of the weighted linear combination coefficient vector c depends on how we use the d derivatives.This article uses the original function, the first derivative, and the second derivative, so c is a three-dimensional vector.There are three weight coefficients, c 0 , c 1 , c 2 .Since the sum of them is one, only two parameters are needed to select.Parameters c 0 and c 1 are selected from the values {0, 1  12 , . . ., 11  12 , 1}.One of the ways to select the optimal weighted linear combination coefficient vector is through k-fold cross-validation to obtain higher classification accuracy.Table 3 shows the classification results corresponding to the optimal parameters.In Table 3, the results in bold indicate that the classification results obtained by using weighted linear combination are better than those obtained by employing specific derivatives.The classification result of the original function on the coffee dataset is 0.9821, while the corresponding classification accuracy of the weighted linear combination is 1.0000.The classification accuracy corresponding to the first derivative on the growth dataset is 0.9570, while the classification accuracy is improved by 0.0322 after applying the weighted linear combination.Similarly, using the weighted linear combination of ham and ECG datasets also improves the accuracy compared to the previous results.Therefore, it also indicates that it is feasible and practical for us to consider employing the weight combination of derivatives.

Parameter Sensitivity Analysis
There is a key parameter δ in FGEPSVM.To analyze the sensitivity of the parameter, we study the effect of the parameter on classification accuracy.The parameter δ is selected from the values {10 i |i = −12, −11, . . ., 12}.For this study, we conduct experiments on datasets, as seen Figure 7.We find that FGEPSVM is sensitive to δ.This indicates that δ has a greater impact on the performance of our proposal, so it needs to be adjusted carefully.It can also be seen that when the parameter δ is relatively small, the classification accuracy of most datasets is the best.

Conclusions
This paper proposes a functional generalized eigenvalue proximal support vector machine, which looks for two nonparallel hyperplanes in function space, making each functional hyperplane close to the functional data in one class and far away from the functional data in the other class.The corresponding optimization problem involves the inner product between two functional data, which can be replaced by the inner product between two coefficient vectors by introducing the orthonormal basis.Thus, the optimization problem in function space is transformed into the vector space through this orthonormal basis.Furthermore, our method is not restricted to the original functional data.The method presented in this paper can be applied to higher-order derivatives.Only when dealing with functional data can we consider using higher-order derivative information.We use the derivatives alone.Besides, we employ the weighted linear combination of derivatives that contain more information.It can be seen that the former is a particular case of the latter.We consider the weighted linear combination of the original function, first derivative, and second derivative.The results of numerical experiments show that the higher-order derivative information is critical to the performance of the classifier.

in ( 15 )
by the weighted linear combination of derivatives x i (t) = d ∑ j=0 c j x (j) i (t), where c j is the coefficient and d ∑ j=0 c j = 1, c j ≥ 0. This process yields data of the form terms with a nonnegative parameter δ.For example, supposing m + = 2, m − = 3, d = 1, l = 2, C + and C − should be defined as

4 :
Solve the generalized eigenvalue problems (37) and (38), obtain z * + and z displays the samples of three artificial numerical examples.

3 Figure 2 .
Figure 2. Boxplots of the results on artificial numerical examples.

Figure 3 .
Figure 3. CPU time for two methods on artificial numerical examples.
Samples of ten individuals of each dataset are plotted in Figure4(d = 0), Figure5(d = 1) and Figure6 (d =2) are the first derivative and second derivative of the curves for functional datasets, respectively.The functional data in class −1 are plotted in solid blue line, whereas functional data in class +1 are depicted with a red dashed line.

Figure 7 .
Figure 7.The classification performance of FGEPSVM with δ on datasets.

Table 1 .
Data description summary.

Table 2 .
Accuracy results on functional datasets.

Table 3 .
Accuracy results on the weighted linear combination.