A Novel Twin Support Vector Machine with Generalized Pinball Loss Function for Pattern Classiﬁcation

: We introduce a novel twin support vector machine with the generalized pinball loss function (GPin-TSVM) for solving data classiﬁcation problems that are less sensitive to noise and preserve the sparsity of the solution. In addition, we use a symmetric kernel trick to enlarge GPin-TSVM to nonlinear classiﬁcation problems. The developed approach is tested on numerous UCI benchmark datasets, as well as synthetic datasets in the experiments. The comparisons demonstrate that our proposed algorithm outperforms existing classiﬁers in terms of accuracy. Furthermore, this employed approach in handwritten digit recognition applications is examined, and the automatic feature extractor employs a convolution neural network.


Introduction
Support vector machines (SVMs) have evolved as a potent paradigm for pattern classification and regression during the last decade [1][2][3][4][5]. The SVM has received substantial attention within a few years after its inception due to its vast application in a wide variety of fields [6][7][8][9][10][11][12]. The standard SVM determines the parallel hyperplanes with the maximum margin between two classes of samples by minimizing structural and empirical risks, as determined by the labeled training data. SVM solves a quadratic programming problem (QPP) using the dual problem to achieve an optimal solution. The standard SVMs also faces a major challenge: the computational complexity of SVM is approximately of order O(m 3 ), where m is a number of training samples. As a result, SVM is quite slow when dealing with large-scale problems [13,14].
For large-scale data learning, Catak [15] combines the ELM algorithm and the Ad-aBoosting method to overcome large-scale data sets. Moreover, to address the highcomputational complexity of SVM, Jayadeva [16] suggested a novel machine learning method known as twin SVM (TSVM) to improve the computational complexity of SVM. For the standard TSVM, the main idea is to find two nonparallel proximal hyperplanes that are closer to one of the two classes while being at least one distance apart. TSVM solves two smaller QPPs, instead of solving a large one as in the classical SVM. Therefore, it makes the computational time of TSVM approximately four times faster than the standard SVM in theory. In binary classification problems, TSVM not only overcomes the challenges of training a classifier faster than a standard SVM, but it also deals with exemplar unbalance. As a result of its excellent performance, TSVM has become one of the most used procedures. TSVM has received increasing attention due to its wide application in various fields, such as text categorization [17], text recognition [18], software defects [19,20], scene classification [21], image recognition [22], speaker recognition [23,24], human action recognition [25], pancreatic cancer early detection [26], and so on. Moreover, TSVM has been widely researched and developed in recent years. There have been numerous variations proposed, such as twin parametric margin SVM (TPMSVM) [27], twin bounded SVM (TBSVM) [28], weighted Lagrangian TSVM (WLTSVM) [29], least squares TSVM (LSTSVM) [30][31][32], large scale TSVM [33], sparse pinball TSVM [34], and so on. Furthermore, TSVM is very useful when dealing with datasets that include a large number of data samples, whereas the standard SVM is ineffective.
Designing a robust machine learning approach, on the one hand, needs the employment of the appropriate loss function. Different margin-based loss functions have recently been employed in classification and regression problems, such as 0-1 loss, hinge loss, squared loss, and so on. The hinge loss controls the penalty on the training data points in standard SVM and TSVM. There are some problems with the model itself, such as the objective function in the primal problem is non-differential, has imbalanced class information, is sensitive to outliers, and is sensitive to feature noise. To address the non-differentiality of the objective function, a smooth SVM (SSVM) [35] has been proposed, where the SSVM creates and solves an unconstrained smooth support vector machine reformulation. When there were outliers, Wu and Liu [36] proposed the robust truncated hinge loss SVM (RSVM) to overcome this problem. For the imbalanced classification problem, Cao and Shen [37] demonstrated a re-sampling strategy that balances training data by combining oversampling and under-sampling. In [38], a powerful weighted multi-class least squares TSVM (WMLSTSVM) method for dealing with multi-class data categorization imbalances was proposed. In the presence of being sensitive to noise, Huang [39] proposed a SVM model to deal with noise sensitivity and instability in resampling, where the pinball loss function (Pin-SVM) is used. The outcome has good properties, such as being less sensitive to noise and related to the quantile distance. However, sparsity is impossible to attain using Pin-SVM. In order to maintain the sparsity, they also proposed an -insensitive zone for Pin-SVM. Although this approach improves the sparsity of Pin-SVM, its formulation necessitates the specification of the value of in advance, and hence a poor choice may have an impact on its performance. As a result of these advances, Rastogi [40] recently proposed the modified ( 1 , 2 )-insensitive zone SVM, which is called the generalized pinball loss SVM. This generalized pinball loss for the SVM model incorporates previous loss functions that provide noise sensitivity, sparsity, and approximate stability. Nevertheless, compared with TSVM, the loss of the generalized pinball SVM is indeed required to solve a single large QPP, resulting in a higher computational complexity and inability to solve large scale problems. However, as far as we are aware, no articles dealing with the generalized pinball loss function in relation to the standard TSVM for classification problems have been published. As a result, the addition of the generalized pinball loss function to the standard TSVM for classification problems is worth investigating. Motivated by the above mentioned models, we introduce the standard TSVM with the generalized pinball loss function. In addition, the proposed objective function is optimized using the Lagrangian multiplier approach and the Karush-Kuhn-Tucker (KKT) optimality conditions [41]. Two smaller quadratic programming problems are solved (QPPs), and we can produce two nonparallel classification hyperplanes. Finally, thorough experiments were carried out to evaluate the proposed GPin-TSVM model performance. The following are the main contributions of the paper: • For pattern classification, we add a generalized pinball loss function to the standard TSVM, resulting in a better classifier model that is called a generalized pinball loss function-based TSVM (GPin-TSVM); • We demonstrate that the proposed algorithm GPin-TSVM surpasses existing classifiers in terms of accuracy in numerical experiments. We also examine its characteristics, such as noise sensitivity and within-class scatter; • We examine the applicability of the main techniques of GPin-TSVM toward handwritten digit recognition problems compared with the standard TSVM, Pin-TSVM, and -insensitive zone TSVM (IPin-TSVM). Moreover, we use the automatic feature extractor by the convolutional neural network (CNN) and TSVM, Pin-TSVM, IPin-SVM, and GPin-TSVM, which work as a binary classifier by replacing the softmax layer of CNN; • We perform numerical testing on a synthetic dataset and datasets from numerous UCI benchmarks with noise of various variances to illustrate the validity of our proposed GPin-TSVM. The results also show the robustness of the proposed approach, which is less sensitive to noise and retains the sparsity of the solution.
In Section 2, we briefly discuss loss functions, SVM, generalized pinball SVM, and TSVM. In Section 3, we present a new approach called GPin-TSVM. In Section 4, the properties of the proposed GPin-TSVM are discussed. The efficiency of our proposed GPin-TSVM by using synthetic datasets and the UCI machine learning repository is compared to standard TSVM, Pin-TSVM, and IPin-TSVM, and the applications of the proposed GPin-TSVM algorithms in handwritten digit recognition are shown in Section 5. Conclusions and future recommendations are presented in Section 6.

Related Work and Background
In this section, standard SVM, TSVM, loss functions, and generalized pinball SVM formulations are briefly described. The interested readers are referred to [28,39,40,42] for a more detailed description.

Support Vector Machine
The difficulty of the SVM model lies in determining the optimal separating hyperplane, or maximal margin hyperplane, that best separates the two classes in order to generalize new data to obtain accurate classification predictions. Consider a two-class dataset of m data samples, where S = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m )} for i = 1, 2, . . . , m, x i ∈ R n is the sample with the label y i ∈ {1, −1}. SVM models receive a separating produced decision function w x + b = 0, where w ∈ R n and b ∈ R from the following problem: where ξ i are the slack variables and C is the trade-off parameter. We obtain its dual QPP as follows by using the Lagrangian multipliers α i : We use the number of support vectors that satisfy 0 < α < C, represented by N SV , and we obtain the following decision function after optimizing this dual QPP: where α * denotes the dual problem solution (2).

Twin Support Vector Machine
Consider the data set S, in which the matrix A ∈ R m 1 ×n represents m 1 data samples from class +1, and the matrix B ∈ R m 2 ×n represents m 2 data samples from class −1. The TSVM [28] is used to determine two nonparallel hyperplanes using the following definitions: x w (1) + b (1) = 0 and x w (2) where w (1) , w (2) ∈ R n and b (1) , b (2) ∈ R. To obtain the pair of nonparallel hyperplanes, the hinge loss function-based TSVM yields the following pair of QPPs: where c 1 and c 2 are positive penalty parameters, ξ is a slack variable, and e 1 and e 2 are vectors of appropriately sized ones. The dual of QPPs (5) and (6) can be represented, respectively, as follows: where P = A e 1 and Q = B e 2 . From the solutions α and β of (7) and (8), respectively, the best separating hyperplanes are given by: Depending on which of the two hyperplanes (4) a new sample point x ∈ R n lies closest to, it is is assigned to class i(i = +1 or − 1) by where |.| denotes obtaining the absolute value. In actuality, the unconstrained optimization problem may be reformulated as the QPP of the TSVM problem (5) as follows [43]: and min w (2) ,b (2) ,ξ 1 2 where L hinge (u) = max{0, u} is known as the hinge loss function and u = (1 − y i (x i w + b)). The hinge loss is a loss function that is commonly used to train classifiers. Furthermore, it strives to optimize the shortest distance between two classes, resulting in resampling instability and noise sensitivity from the related classifier [42]. To deal with the problem of noise sensitivity, Huang [39] presented utilizing the pinball loss function by combining the SVM classifier with the pinball loss function. The pinball loss function explains how this approach works by penalizing correctly identified data as follows: where τ ≥ 0 is a user-defined parameter. The so-called pinball loss is a well-known method in statistics and machine learning for calculating conditional quantiles. Despite achieving noise insensitivity, the pinball loss function is unable to achieve sparsity in the process.
In their work, Huang examined a similar type of pinball loss function to ensure sparsity, which is a -insensitive pinball loss. The use of the -insensitive pinball loss function increases the prediction performance of the SVM model significantly. It also maintains sparsity in the SVM model. This function is defined as follows: where τ ≥ 0 and ≥ 0 are user-defined parameters. In an SVM model, sparsity is well known to be a highly desirable property. A sparse SVM model constructs the decision function from a small number of training data points and predicts the responses of test data points in a very short amount of time. The width of the -insensitive zone function fluctuates with the τ values, but it should ideally change with the variation in the training data response values. In practicality, it also makes choosing a good -value difficult. As a result, Rastogi [40] saw the necessity to create an -insensitive pinball loss function that can be used to enhance the -insensitive method in SVM. They proposed an ( 1 , 2 )-insensitive zone pinball loss function by used this loss in combination with the SVM model. It is also called a generalized pinball SVM, with the following loss function: where τ 1 , τ 2 , 1 , and 2 are non-negative parameters. In the next subsection, we briefly describe the generalized pinball SVM model that is proposed by Rastogi. This approach is a modification of previous loss functions that takes noise sensitivity, resampling stability, and data scatter minimization into account.

Support Vector Machine with Generalized Pinball Loss
With this generalized pinball loss function, the resulting formulation, termed as a generalized pinball support vector machine, is proposed by Rastogi [40], which results in the unconstrained optimization problem: Then, the problem (15) can reformulate to the following QPP: Its dual QPP is generated as follows by inserting the Lagrangian multipliers α i and β i : We can obtain the decision function (18) by solving the dual problem of (17): However, for large-scale applications, the generalized pinball SVM has a high computing complexity and is quite slow. In the next section, we go after the heart of our proposed technique, which is to reduce the high computational complexity by proposing a TSVM with a generalized pinball loss function that is aimed toward the binary classification problem, and to present both the linear and nonlinear cases, as shown in Figure 1.

Proposed Twin Support Vector Machine with Generalized Pinball Loss (GPin-TSVM)
In this section, we employ the Lagrange multiplier approach to derive the solution for our GPin-TSVM model, which is based on just the generalized pinball loss function. In both linear and nonlinear scenarios, our GPin-TSVM can be employed.

Linear Case
In the standard TSVM, we determine the generalized pinball loss and obtain the following QPPs: and The problems (19) and (20) are translated further into equivalent known formulations (5) and (6) by adding a slack vector ξ, yielding the following QPPs: where τ 1 , τ 2 , τ 3 , τ 4 , 1 , 2 , 3 , and 4 are non-negative parameters. We transform (21) and (22) to their dual form to arrive at the solution. For this, we use (21) and introduce the Lagrange multipliers α ≥ 0, β ≥ 0, and γ ≥ 0, and we obtain the Lagrange function: We use the KKT optimality conditions to find the following results: By using (26) and β ≥ 0, we obtain Combining (24) and (25) yields Define λ = α − γ, P = A e 1 , and Q = B e 2 . Equation (31) can be recast using these notations as follows: We can obtain the dual of (21) using Equation (23) and the given KKT conditions as follows: The dual problem of (22) can be derived similarly: where ω ≥ 0 and µ ≥ 0 are Lagrange multipliers. Finally, the best separating hyperplanes are given by: Since we cannot ensure P P and Q Q are irreversible, it is always positive semidefinite; however, in some circumstances, it may not be well conditioned. To account for the possibility of ill-conditioning of P P and Q Q, the regularization term δI(δ > 0) must be used [44]. Depending on which of the two hyperplanes (4) a new sample point x ∈ R n lies closest to, it is assigned to class i(i = +1 or −1) by

Nonlinear Case
In higher dimensions, support vector machines are even more difficult to interpret. It is considerably more difficult to view how the data can be separated linearly and what the decision boundary will look like. In practice, however, data are rarely linearly separable; therefore, we must transform it into a higher-dimensional space before developing a support vector classifier. This problem can be solved using the symmetric kernel trick. Now, we use a symmetric kernel method to extend our linear GPin-TSVM to the nonlinear case [28,45]. The symmetric kernels used have a significant impact on how well GPin-TSVM functions. The nonparallel hyperplanes in the kernel-generated space are as follows if the defined kernel function is K(·, ·): where w (1) , w (2) ∈ R m , and X = A m 1 ×n B m 1 ×n . For the nonlinear case of the problems (21) and (22), the corresponding problems are The Lagrange function is applied, and the KKT optimality requirements are used to produce the dual of (37): Similarly, the dual of Equation (38) can be obtained as follows: where P = K(A, X ) e 1 , Q = K(B, X ) e 2 , and α, γ, ω, and µ are Lagrange multipliers. Finally, the best separating hyperplanes are given by: and Thus, a new sample point x ∈ R n is assigned to class i(i = +1 or − 1) by

Properties of the GPin-TSVM
We examine the noise insensitivity and within-class scatter properties of the GPin-TSVM in this section.

Noise Insensitivity
The principal advantage of our proposed algorithm GPin-TSVM is that it is insensitive to noise and maintains the sparsity. In this subsection, we explain the advantage of giving a penalty on a correctly classified point and conserving the sparsity to a certain scale at the same time. Consider the generalized sign function sgn 1 , 2 τ 1 ,τ 2 (1 − y(w x + b)) as sgn 1 , 2 τ 1 ,τ 2 (u) is the subgradient of (14). In the linear case, we will concentrate on the first model of the GPin-TSVM for clarity. Using the KKT optimality condition, Equation (19) can be written as: where 0 is a zero vector. For the given w (1) and b (1) , the entire index set can be divided into five different subsets: The data samples in E + 3 may not benefit w (1) because the sub-gradient at all these datasets is zero, which is shown in Equation (42). As a result, E + 3 has a direct impact on the model sparsity. We perceive that 1 and 2 control the number of samples in E + 3 . As 1 and 2 approach 0, sparsity is lost, whereas if 1 → ∞ and 2 → ∞, we increase the sparsity as a consequence of having more samples in E + 3 . Using the notation E + 1 , E + 2 , E + 3 , E + 4 , and E + 5 , Equation (43) can be rewritten as the existence of ψ i ∈ [0, τ 1 ] and θ i ∈ [−τ 2 , 0], such that Theorem 1. Let p 1 be the number of samples x − i in E + 1 . The following inequalities must hold if the optimization problems (33) or (39) have a solution: be an arbitrary sample in E + 1 . We have β i 0 = γ i 0 = 0 by using the KKT condition (28) and (29). We obtain α i 0 = c 1 τ 1 by using the KKT condition (26), which implies that (25). We can obtain −c 1 τ 2 ≤ λ i ≤ c 1 τ 1 because α i ≥ 0 and γ i ≥ 0. As a result, we obtain −e 1 (Aw (1) . Finally, we have Theorem 1 implies that 1 − is an upper boundary of the number of samples in E + 1 . The parameters τ 1 and τ 2 control the numbers of samples in E + 1 , E + 3 , and E + 5 . When there is a decrease in τ 1 and τ 2 , then the number of elements in E + 1 becomes smaller and the classification result is sensitive to feature noise around the decision boundary, which will have a considerable impact. When τ 1 and τ 2 are both large, all three sets contain a large number of samples, making the outcome less sensitive to feature noise.
Briefly, parameters 1 , 2 , τ 1 , and τ 2 control the tradeoff between sparsity and noise insensitivity. Similarly, we can separate the index set into the five sets on the second model of the GPin-TSVM: where i = 1, . . . , m 1 . Similar properties of the parameters τ 3 and τ 4 can be obtained as follows: Theorem 2. Let p 2 be the number of samples x + i in E − 1 . If the optimization problems (34) or (40) have a solution, then the following inequalities must hold: It also indicates that 1 − is an upper bound on the number of samples in E − 1 .

Scatter Minimization
Scatter minimization can also be used to understand the GPin-TSVM. For simplicity, consider only the first QPP (19) of the GPin-TSVM. The conclusions for another QPP (20) can also be obtained in this manner. For the given x − i ∈ B and x + j ∈ A, the positive hyperplane x w (1) + b (1) = 0 can be established by data samples under the subset Y + 2 ⊆ A, and the two hyperplanes H + = {w (1) respectively. The scatter is calculated by adding the distances between each point x − i and one supplied negative sample can be determined as We obtain the following equation by using w (1) Similarly, using a specific data sample x + j 2 ∈ Y + 2 , the scatter for every sample x + j ∈ A is calculated as follows: Due to the fact that the scatter is a positive value, we can take it as the sum of squares, i.e., ∑ Consider the formula as follows: where c 11 is a constant. This guarantees that the first term may be expressed in such a way that the scatters x + j ∈ A about the hyperplane x w (1) + b (1) = 0 are minimized. Nevertheless, this second term seeks to lower the error values caused according to how close B samples must be to H + by minimizing the scatter of x − i ∈ B from around hyperplane H + . In the GPin-TSVM (19), the first term of (45) is mentioned in its mathematically equivalent form, whereas the absolute value used in (45) is extended to L 1 , 2 τ 1 ,τ 2 . The first term of (45) is expressed within GPin-TSVM (19) in its mathematically equivalent form, whereas the absolute value employed in (45) is extended to L 1 , 2 τ 1 ,τ 2 . Concretely, we introduce the misclassification term (45), where c 12 , c 13 , c 14 and c 15 are positive parameters; that is, The GPin-TSVM (19) can be obtained using the following conditions: c 11 + c 12 + c 13 = c 1 τ 1 , c 11 + c 13 = 0, c 11 + c 14 = 0, and c 11 + c 14 + c 15 = c 1 τ 2 . We have τ 1 = c 12 c 1 and τ 2 = c 15 c 1 from the first and last condition. The interpretation of this report is that the reasonable range of τ 1 ≥ 0 and τ 2 ≥ 0. We examine the misclassification error and the within-class scatter of one class simultaneously in the generalized pinball loss minimization. The GPin-TSVM (19) is then regarded as a trade-off between low misclassification and reduced scatter.

Numerical Experiments
In this section, the classification performance of the proposed approach in terms of accuracy is compared to that of other relevant approaches, such as the hinge loss twin support vector machine (TSVM), pinball loss TSVM (Pin-TSVM), and -insensitive loss TSVM (IPin-TSVM), on synthetic datasets and the UCI machine learning repository [46], and handwritten digit recognition applications have been proposed. We employed 10-fold cross validation for all of our experiments. The average accuracy and standard deviation for each experiment are displayed in all tables, with the best one highlighted.

Synthetic Dataset
We test our approach on a two-dimensional case in which equal samples are drawn from two Gaussian distributions:  Figure 2, the bar chart demonstrates the percentages of the accuracy of classifying different sectors, including GPin-TSVM, IPin-TSVM, Pin-TSVM, and TSVM during r = 0% to 30%. In the large majority of cases, the GPin-TSVM produces the greatest outcome. This implies that the GPin-TSVM was the strongest candidate for the method of classifying the noise-corrupted data. In the next result in Figure 3, we show the obtained value of the slopes of hyperplanes over four different noisy synthetic datasets by SVM, TSVM, and the proposed GPin-TSVM.
In this result, we show that, when the level of noise increases from 0 to 20%, the hyperplanes of SVM diverge from 0.6575 to 0.1592 and the hyperplanes of TSVM diverge from 1.7988, 1.2688 to 0.2879, 0.2562, whereas the hyperplanes of our GPin-TSVM slightly changes. This suggests that our proposed GPin-TSVM model is unaffected by noise near the boundary.

UCI Datasets
Additionally, we perform testing on 10 benchmark datasets from the UCI machine learning database [46]. Imbalanced datasets lead to incorrect classification in classification problems. The imbalance ratio (IR) [47] is defined as the ratio of the number of data points on the majority class to the number of data points on the minority class. IR = number of data points on the majority class number of data points on the minority class .
The dataset descriptions can be found in Table 1. To modify the tradeoff parameters and kernel parameter σ for UCI benchmark datasets, we used the grid search method [48]. A validation set of 10% randomly selected data points was used for each dataset. We chose values for parameters c 1 and c 2 from the set {10 i |i = −2, −1, 0, 1, 2} for our tests. Further, another parameter was tuned in the range {0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 2.5}. Tables 2 and 3 summarize the experimental results of four approaches (TSVM, Pin-TSVM, IPin-SVM, and GPin-TSVM) on linear and RBF kernels, respectively. The optimal parameters used in Tables 2 and 3 are summarized in Tables 4 and 5, respectively. Accuracy is defined as the mean value of ten time-testing results plus or minus the standard deviation in Tables 2 and 3.  Table 3. On UCI datasets, 10-fold cross validation using the RBF kernel yielded the mean accuracy (%) and standard deviation.  Table 2 illustrates the results of applying TSVM, Pin-TSVM, IPin-TSVM, and our proposed GPin-TSVM to a linear kernel on eight distinct UCI datasets. The results with the highest accuracy are highlighted in bold. In most datasets, the classification performance of GPin-TSVM outperforms TSVM, Pin-TSVM, and IPin-TSVM in terms of accuracy, according to the experimental results. Our proposed GPin-TSVM has the highest prediction accuracy in 20 of the 32 scenarios. Furthermore, when the number of noise samples varies from r = 0 (noise free) to r = 0.2, our proposed GPin-TSVM outperforms existing methods in terms of classification accuracy and stability. However, in Breast, the classification accuracies of IPin-TSVM are better than those of our proposed GPin-TSVM.
The nonlinear kernel with an RBF kernel was subjected to a similar analysis, with the results presented in Table 3. Our proposed GPin-TSVM has the best prediction accuracy in 20 of the 32 cases. In most of the datasets, our proposed GPin-TSVM offers the best prediction accuracy, as shown in Tables 2 and 3. As a result, the accuracy of our proposed GPin-TSVM outperforms that of existing models.
The sparsity of the proposed approach of the GPin-TSVM is compared to that of the standard TSVM for the linear and nonlinear cases in Tables 6 and 7, respectively. When we look at the results, we can see that as 1 = 2 grows, our solution becomes more sparse. It is clear from both tables that our proposed GPin-TSVM is more sparse than the standard TSVM while still keeping noise-insensitive properties. The prediction process is faster than the standard TSVM because of the sparsity of the solution, which is extremely useful in datasets with big samples.  Table 5. The optimal parameters of Table 3.

Hybrid CNN-GPin-TSVM Classifier for Handwritten Digit Recognition
The proposed algorithms GPin-TSVM and their application to handwritten digit recognition problems are discussed in this part. Handwritten digit recognition is a difficult topic that has been intensively researched in the subject of handwriting recognition for many years. As a result of its many practical uses and financial implications, handwritten digit recognition is still a popular topic. Here, we use MNIST handwritten datasets to carry out the experiments. In the field of machine learning, the MNIST dataset is commonly used for training and testing. There are 60,000 samples in the training set and 10,000 in the test set in this dataset. Each sample has a size of 28 × 28 pixels. As seen in Figure 4, the MNIST dataset comprises grayscale images of handwritten digits from '0' to '9'. Several approaches for handwriting recognition have been proposed in the literature, such as k-nearest neighbor (KNN) [49], SVM [49][50][51][52], artificial neural network (ANN) [53,54], convolutional neural network (CNN) [51,55,56], etc. electroencephalogram (EEG) data classification problem, Xin [58] constructed a convolution support vector machine for classifying epilepsy EEG signals, and produced the highest accuracy. Recently, the hybrid CNN-SVM classifier for recognizing handwritten digits was proposed by [50][51][52]. They created a hybrid model that combines a powerful CNN with a SVM for handwritten digit recognition using the MNIST dataset, where SVM is a binary classifier and CNN is an automatic feature extractor, which both display a strong efficiency for handwritten digit recognition. Inspired by this particular work, the goal of this section is to use CNN to extract features from the MNIST dataset of input handwritten digit images. TSVM, Pin-TSVM, IPin-SVM, and GPin-TSVM work as a binary classifier, replacing the softmax layer of CNN. Moreover, we compare the performance between TSVM, Pin-TSVM, IPin-TSVM, and GPin-TSVM. We choose four pairs of handwritten digits on raw pixel features for our comparisons.
We build the network using the following. The first is placing the convolutional (Conv2D) layer into a channel of dimension 1, since the images are grayscale. The kernel size is set to 5 × 5 with a stride of 1. This convolution output is set to nine channels, implying that it will extract nine feature maps using nine kernels. We use a padding size of 1 to ensure that the input and output dimensions are the same. These layer output dimensions are 9 × 28 × 28. The second convolutional (Conv2D) layer has an input channel size of 9. We set the output channel size to 16, which implies that 16 feature maps will be extracted. This layer kernel size is 5 with a stride of 1. After that, we add a RelU activation and a pooling (MaxPool2D) layer with a kernel of size 2 and a stride of 2. The pooling layer is mainly integrated to reduce the data dimension. Finally, two fully connected layers are used. The first fully connected layer will receive a flattened version of the feature maps. As a result, it must have a dimension of 16 × 7 × 7, or 256 nodes. This layer will be connected to a fully connected 80-node layer. Finally, a hidden layer of the neural network containing 84 nodes is implemented. After the architecture of the model is defined, the model needs to be compiled. Here, we use TSVM, Pin-TSVM, IPin-SVM, and GPin-TSVM, as it is a binary classification problem. The architecture of the proposed model is described in Figure 5. We compare the performance of the proposed model on MNIST handwritten datasets with other supervised recognition systems. The results of the suggested approach on the MNIST handwritten dataset are shown in Figure 6 and the optimal parameters of the result in Figure 6 are shown in Table 8. From Figure 6, we can learn that the classification performance of GPin-TSVM yields the best prediction accuracy of two pairwise digits out of the four total ones in terms of accuracy. On the 1 vs. 7 pairwise digit, our proposed GPin-TSVM and IPin-TSVM have an accuracy of 99.80%, which is greater than the recognition accuracy of another Pin-TSVM and TSVM classifier. However, the accuracy of GPin-TSVM on some pairwise digits, such as 5 vs. 8, is not the best. Overall, our GPin-TSVM performs well in terms of accuracy. Table 8. The optimal parameters of the result in Figure 6.

Statistical Analysis
On the four pairs of handwritten digits, the Friedman test is primarily used to evaluate the classification performance of the proposed GPin-TSVM algorithm. The Friedman test, along with post hoc testing, is a statistical test method that ranks algorithms differently for each data set, with the best method having the lowest ranking number [59]. The tests allow for a more accurate assessment of the algorithms' relevance. We compare four different classifiers on four different pairs of handwritten digits. The accuracy of the related classifiers on each dataset is ranked, and the classifier with the highest accuracy has the smallest rank r i . Based on the accuracy of the four pairs of handwritten digits, the average rank of all methods is shown in Table 9. Under the null hypothesis, the chi-square distribution and F-distribution with a degree of freedom (k − 1)(N − 1) in the Friedman test are: , respectively, where R j = 1 N ∑ N j=1 r j , the number of methods is k, and the number of datasets is N. According to Table 9, we obtain X 2 F = 10.58 and F F = 22.35. For a significance level of 0.05, the F(3, 9) critical value is 3.86, and 22.35 > 3.86. As a result, the null hypothesis is rejected, i.e., there is a significant difference here between the four classifiers. Furthermore, as shown in Table 9, the proposed GPin-TSVM was ranked lowest on average. On the four pairs of handwritten digits, the classification performance of the proposed GPin-TSVM outperforms the other classifiers.

Conclusions
In this paper, a new version of a TSVM for pattern classification-a twin support vector machine-is created, and a generalized pinball loss function (GPin-TSVM) is implemented to improve the TSVM generalization performance, providing a lower sensitivity to noise and the ability to handle losing sparsity. We conduct wide experiments on synthetic datasets and the UCI machine learning repository, and handwritten digit recognition applications are compared to standard TSVM, Pin-TSVM, and IPin-TSVM. In most cases, the accuracy performance of our suggested GPin-TSVM is superior to that of existing classifiers, according to the experimental data. Additionally, the GPin-TSVM is less sensitive to noise and achieves sparsity, which is a major benefit of our proposed method. In the proposed GPin-TSVM, we also investigate the effect of value i (i = 1, 2). GPin-TSVM is more sparse than standard TSVM for the sparsity of the solution. At last, the proposed algorithm of GPin-TSVM was used to solve the problem of handwritten digit recognition in the application, and we used the Friedman test to evaluate the classification performance of the proposed GPin-TSVM algorithm. From the results, it can be seen that our proposed GPin-TSVM is an effective approach for handwritten digit recognition, thereby demonstrating the effectiveness of the proposed algorithm.
Our future study will focus on the applicability of GPin-TSVM to multi-class supervised classification problems and to large-scale classification problems.
Author Contributions: Conceptualization, W.P. and R.W.; methodology, W.P. and R.W.; software, W.P. and W.R.; validation, W.P.; writing-original draft, W.P.; writing-review and editing, W.P. and R.W.; supervision, R.W. All authors have read and agreed to the published version of the manuscript.