PFW: Polygonal Fuzzy Weighted—An SVM Kernel for the Classification of Overlapping Data Groups

: Support vector machines are supervised learning models which are capable of classifying data and measuring regression by means of a learning algorithm. If data are linearly separable, a conventional linear kernel is used to classify them. Otherwise, the data are normally first transformed from input space to feature space, and then they are classified. However, carrying out this transformation is not always practical, and the process itself increases the cost of training and prediction. To address these problems, this paper puts forward an SVM kernel, called polygonal fuzzy weighted or PFW, which effectively classifies data without space transformation, even if the groups in question are not linearly separable and have overlapping areas. This kernel is based on Gaussian data distribution, standard deviation, the three-sigma rule and a polygonal fuzzy membership function. A comparison of our PFW, radial basis function (RBF) and conventional linear kernels in identical experimental conditions shows that PFW produces a minimum of 26% higher classification accuracy compared with the linear kernel, and it outperforms the RBF kernel in two-thirds of class labels, by a minimum of 3%. Moreover, Since PFW runs within the original feature space, it involves no additional computational


Introduction
Machine learning algorithms learn from observations whose class labels are already defined, and they make predictions about new instances based on the models constructed during the training process [1]. There are three main approaches to machine learning: supervised, unsupervised and semi-supervised [2].
Supervised learning infers classification functions from training labeled data. Major supervised learning approaches are multi-layer perceptron neural networks, decision tree learning, support vector machines and symbolic machine learning algorithms [3]. Unsupervised learning infers functions from the hidden structure of unlabeled data. The major unsupervised learning approaches are clustering (k-means, mixture models and hierarchical clustering), anomaly detection and neural networking (Hebbian learning and generative adversarial networks) [3]. Semi-supervised learning infers classification functions from a large amount of unlabeled data, together with a small amount of labeled data [4]. It thus falls between supervised learning and unsupervised learning. The main semi-supervised learning methods are generative models, low-density separation, graph-based methods and heuristic approaches [3].
Among supervised learning models, the support vector machine (SVM) proposed by Boser, Guyon and Vapnik in 1992 [5] is one of the best-known classification methods. SVMs are capable of solving complicated tasks, like local minima, overfitting and high dimension, while delivering an outstanding performance [6]. Important elements for the efficient use of SVMs are data preprocessing, the selection of the correct kernel, and optimized SVM and kernel setting [5]. The main limitation with SVMs is the difficulty of resolving quadratic programming when the number of data points is large [6].
SVM classification algorithms fall into two classes: linear and nonlinear [7]. A linear classifier is the dot product of a data vector and a weight vector, with the addition of a bias value. Assuming x is a vector, w is a weight vector and b is the bias value, Equation (1) is the discrimination function of a two-class linear classifier [5].
If b = 0, the points x such that w x = 0 are all vectors to w that go through the origin. In two dimensions, it is a line; and in three dimensions, a plane and, generally, a hyperplane. The bias b is a translation of the hyperplane from its origin [8].
If the data are not separable by a straight line (a hyperplane)-as shown in Figure 1B, above-a non-straight line separates the instances. This type of data is in nonlinear format, and the classifier for it is called nonlinear [5]. The advantage of linear classifiers is that the training algorithm is simpler and it scales with the number of training instances. Linear classifiers, as illustrated in Figure 2, can be converted into nonlinear ones by mapping the input data from space, x, to the feature space, F, by means of a nonlinear function, ɸ. Equation (2) is the nonlinear discrimination function in feature space, F [9].  When mapping analogies are applied to input vectors in the d-dimensional space, for feature space, F, the dimensionality is quadratic in d. This entails a quadratic increase in the memory storage required for each feature and in the time needed to compute the classifier's discrimination function. For low data dimensions, this quadratic increase in complexity may be acceptable, but at higher data dimensions, it becomes problematic and, if higher degree monomials are used, completely unworkable [7]. To overcome this issue, kernel methods are devised to avoid explicit mapping of the data into high-dimensional feature space [5].
In 1963, Vapnik initially proposed a linear classifier, using a maximum-margin hyperplane algorithm [3]. In 1964, Aizerman et al., for the first time, proposed the kernel method. Subsequently, in 1992, Boser et al. put forward the method of creating nonlinear classifiers by applying the kernel method to maximum-margin hyperplanes.
The strength of the kernel method is that it can convert any algorithm that is expressible in terms of dot products between two vectors into nonlinear format [10]. A kernel function indicates an inner product within the feature space, and it is generally denoted as in Equation (3), below.
K(x,y) = <ɸ(x),ɸ(y)> An applicable kernel function is symmetric and continuous, and it ideally has a positive definite Gram matrix: The kernel matrix holds only non-negative values. Algorithms which can operate with kernel functions include support vector machines (SVM), kernel perceptrons, Gaussian processes, canonical correlation analysis, principal component analysis (PCA), ridge regression, linear adaptive filters and spectral clustering [10]. The main SVM kernel types are linear, polynomial, radial basis function and sigmoid kernels [12,13].

Literature Review
Kernels are designed to project input data into feature space, aiming to achieve a hyperplane that efficiently separates data of different classes [26]. Since this paper proposes an SVM kernel, we first review, below, relevant previous research on the SVM kernels currently in use, focusing on their description, the equations they employ, their applications and, finally, their advantages and disadvantages.
Linear kernels are the simplest kernel function, classifying data points in two classes by use of a straight separator line [26]. Since the classification is carried out in a linear format, no mapping of data points to feature space is required. The function is the inner product (x, ), with the addition of an optional constant, c, called bias [26] (Equation (4)).
The main advantages of linear kernels are the quickness of their training and classification processes (since they do not use kernel operations); their low cost; the lower risk of overfitting than with nonlinear kernels; and the lower number of optimization parameters required than with nonlinear kernels [3,10]. Linear kernels can outperform nonlinear ones when the number of features is large relative to the number of training samples, and also when there is a small number of features but the training set is large. Against that, their main disadvantage is that, if the features are not linearly separable-by a hyperplane-nonlinear kernels, such as the Gaussian one, will usually produce better classifications [3,10].
Polynomial kernels are a commonly used kernel function with SVMs that enable the learning of nonlinear models. In the feature space, these characterize similarities between training input vectors and polynomials of the original variables [3,10]. In addition to features of input samples, polynomial kernels normally use a combination of features, which, for the purposes of regression analysis, are called interaction features [3,10].
Equation (5), below, denotes the polynomial kernel function when the degree of polynomials is N. The variables x and y are the input space vectors, and c ≥ 0 is an influence tradeoff parameter between lower order and higher order polynomial terms. If c = 0, then the kernel is homogeneous. Numerical instability is the problem of polynomial kernels. If + < 1, by increasing the value of N, the kernel function declines toward zero, while if + > 1, it tends toward infinity.
A radial basis function (RBF), or Gaussian kernel, is a function which classifies data on a radial basis [3,10]. The separator function with two input space feature vectors of x and y is defined as in Equation (6), below.
‖ − ‖ is the squared Euclidean distance between the two input space feature vectors, while is an optional value parameter. Equation (7) is a simplified version of the RBF kernel function that substitutes with . The function value decreases by distance, ranging between zero (within the limit) and 1 (if x = y).
How RBF kernels behave depends to a large extent on the selection of the gamma parameter. An overly large gamma can lead to overfitting, while a small gamma can constrain the model and render it unable to capture the shape or complexity of the data [3,10]. In addition, this type of kernel is robust against adversarial noise and in predictions. However, it has more limitations than neural networks [27,28].
Sigmoid kernels, also known as hyperbolic tangent and multilayer perceptron (MLP) kernels, are an SVM classifier inspired by neural networks. A bipolar sigmoid function is also used as an activation function for the artificial neurons [3,10]. In SVMs, sigmoid kernels are the equivalent of a two-layer perceptron neural network. A Sigmoid function kernel is shown in Equation (8), below.
The value of parameter is generally , with N representing the data dimension and c the intercept constant. In certain ranges of and c, a sigmoid kernel behaves like an RBF. The operation x.y is dot product between the two vectors. The application of Sigmoid kernels is similar to that of RBFs and depends on the chosen level of cross-validation [27,28]. It is an appropriate kernel to use particularly for nonlinear classification in two dimensions or when the number of dimensions is high.
The main advantages of sigmoid kernels are (1) the differentiability at all points of the domain; (2) the fast training process; and (3) presenting a choice of different nonlinearity levels by choosing the sophistication and amount in the membership function [29]. The main drawback of sigmoid is their limited applicability and the fact that they only outperform RBFs in a limited number of cases [29].

Problem Statement
Typically, linear SVM kernels are used to separate class labels by means of lines. However, the data groups involved are not always linearly separable. Kernel functions seek to resolve this problem by projecting the data points from the original space to feature space, in order to enhance their separability. However, kernel functions suffer from limitations and cannot always provide an effective or cost-effective solution. A first problem is that they are not applicable to all datasets. In addition, the transformation process itself is expensive and increases both training and prediction costs.
To overcome the above difficulties, this study introduces an SVM kernel which can effectively classify linearly inseparable class labels, without using kernel functions.

Fuzzy Weighted SVM Kernel
A number of statistical and probability functions and the related abbreviations are used in this research. Table 1, below, lists and briefly describes these for clarity and consistency.

Principal Probability and Statistical Functions
In the theory of probability, normal or Gaussian distribution is defined as a continuous probability distribution that is based on the premise that data distribution will converge to the normal, especially in natural science, if a sufficiently large number of observations is made. As this normal distribution takes the form of a bell when plotted on a graph, it is also often informally called the bell curve.
Standard deviation (generally denoted as ) is a statistical function used to quantify the amount of variation or dispersion in a given dataset. Moreover, it is generally used to define the level of confidence in the accuracy of a given set of data. A high value for standard deviation means that the data are spread over a wide range, whereas a low value indicates that most of the data are close to the mean.
In statistics, the three-sigma, 68-95-99.7, or empirical rule, describes the density of data within standard deviation bands at both sides of the mean point. Specifically, in Gaussian distribution 68.27% (Equation (9)), 95.45% (Equation (10)) and 99.73% (Equation (11)) of the data values should be located within a distance of one, two and three standard deviation bands from both sides of the mean point, respectively.

Data Classification
In the proposed kernel of this research, the role of standard deviation is to quantify the Gaussian distributed data, while the three-sigma rule defines the importance of the data placed in each band.
According to the three-sigma rule, the data located in the first standard deviation bands are the most reliable. However, reliability of data declines from the second standard deviation bands toward the third bands, and any data lying beyond the third bands are considered as noise and, accordingly, ignored. Figure 3, below, presents an illustration of quantified Gaussian data distribution. The first, second and third standard deviation bands are highlighted in different colors, with the three right and left bands labeled as Rσ , Rσ and Rσ and Lσ , Lσ and Lσ respectively. When data are normally distributed according to the three-sigma rule, the central (first) standard deviation bands hold the highest density, while this declines toward the outer bands. If, however, the data are abnormally distributed, they may not comply with the three-sigma rule: the data density in some of the bands might be higher or lower than the expected normal values, and some bands might not even exist. Figure 4A, below, illustrates a normal data distribution, with a higher density within the Lσ and Rσ bands and a lower density in the more distant bands. Figure 4B, on the other hand, displays an instance of abnormally distributed data: a high density within Lσ , but the rest of the data dispersed only across the right bands. In Figure 4B, not only do the Lσ and Lσ bands not exist, but the LLσ border has had to be relocated to the point of 0. However, it should be noted that sometimes border relocation, as is the case with LLσ in Figure 4A, might be necessary, even if the data are overall normally distributed. The criteria for calculating the left and right borders of the left and right standard deviation bands, respectively, are shown in Equations (12) and (13), below. The mean and max denote the need to relocate borders, while the word "Null" indicates that, due to abnormal data distribution, the band does not exist.

Membership Function
On the discoursed universe of X, the fuzzy membership function of set A is defined as → [0, 1], where the mapping values of X elements varies between 0 and 1. This membership degree quantifies the extent of membership of element X within fuzzy set A. In fuzzy classification, when the membership criteria resemble a polygon, the function is normally denoted as P(x) rather than . In Figure 5A, above, the polygon's limits are respectively labeled as a, b, c, d, e and f, where a < b < c < d < e < f. In Figure 5B,C, the polygonal shape of section A is mapped onto the Gaussian distribution, which is segmented according to standard deviation and the three-sigma rule. This mapping is an essential part of the designed classifier. The central standard deviation bands, as can be seen in the above figure, hold the most reliable data and are the densest bands. The inflation points of c and d are mapped onto the boundaries of the central bands. The whole area between these points receives the full membership degree of 1, as it holds the most reliable data. The lower limit a and upper limit f are respectively mapped onto LLσ and RRσ , as the leftmost and rightmost borders of reliable data. The areas between LLσ and LLσ (a and c), as well as between RRσ and RRσ (d and f), are divided into two equal parts. The leftmost and rightmost bands (a to b and e to f) receive a membership degree equal or bigger than 0 and less than 0.5, as these hold the least reliable data, and the bands at sides of the central area (b to c and d to e) receive a membership degree equal or bigger than 0.5 and less than 1, as a medium-to-high reliable portion of data, depending on their precise location along the axis. Figure 5B,C represents the short-tail and long-tail Gaussian data distributions, respectively. In both distributions, because of the three-sigma rule, the two left and the two right polygon sides follow slope of the distribution, resulting in the achievement of a more accurate fuzzy membership degree. This is designed to maximize classification accuracy. The criteria for calculating the polygonal fuzzy membership function are shown in Equation (14), below.
Based on the mapping inflation points of Figure 5B,C, the equivalent values of the limits a, b, c, d, e and f are substituted in Equation (14) and then simplified to construct Equation (15).
However, data do not always appear in a single feature. In such circumstances, the membership degree is the average of the membership degrees of all relevant features. To support multidimensional features, Equation (15) is modified into Equation (16), below, which allows for the calculation of the membership degree of the given sample, x, in the i th feature of k features.
The probability rule of sum is used when we calculate the probability of a union of events, by adding the distinct probabilities together. In other words, it gives the total probability of the final result when there are mutually exclusive events. The summation also could be interpreted as a weighted average-a value that is helpful for problem-solving. Using this rule, Equation (17)

Noise Filtering
The designed classifier eliminates and mitigates natural data noise in two ways. First of all, it separates reliable data from noisy data, based on Gaussian data distribution and the three-sigma rule. In other words, it ignores any data beyond the reliable boundary. Second, it determines the influence of any given data in decisions, by computing their membership degree ( ). In simple terms, data located in the central bands are more influential than data located in the side bands.

Reference Profiles
The PFW kernel records the statistical properties of all classes of the training data and stores these based on the structure presented in Table 2, below. The number of profiles corresponds to the number of data classes, and these are then used to classify future given instances. The extracted statistics from each feature of each data group are divided into six standard deviation bands. Lσ and Rσ (LLσ ( )≤ ≤ RRσ ( )) hold the most trustworthy data for the i th feature, and hence the membership degree ( ) for these bands equals 1. The membership degree for Lσ (LLσ ( )≤ <LLσ ( )) and Rσ (RRσ ( )< ≤ RRσ ( )) is 0.5≤ ( )<1; and for the third bands, Lσ (LLσ ( )≤ <LLσ ( )) and Rσ (RRσ ( )< ≤ RRσ ( )), the value is 0≤ ( )<0.5. In cases where a band does not exist, because of abnormal data distribution, the associated value range in the profile is replaced with "Null".
At classification time, PFW kernel calculates the membership degree of the given instance in comparison with each one of the profiles, according to Equation (17). Its class label is then detected based on the profile that achieves the highest membership degree.

Evaluation and Comparison
To examine the capabilities and classification accuracy of PFW kernel, we tested it against linear and RBF SVM kernels, in two approaches. First, the run-time of the kernels, and second, using the kernels as classification engines for PSW steganalysis [31], to compare against each other for accuracy. The sections below describe the results of time complexity evaluation, the classification experiment area, the chosen classification feature, the properties of the training and testing image sets, and, finally, a comparison of the achieved accuracy in classification, together with a discussion about the proposed kernel.

Run-Time Comparison
Using polygonal fuzzy membership in PFW lowers the complexity of training and classification processes to a set of very simple calculations, in Equation 15. To evaluate this improvement, the runtime of PFW, RBF and linear kernels is measured within an identical condition: to this end, a database in a single dimension and 1000 numerical feature values in two classes produced and fed to the kernels-80% for training and 20% for classification. Scikit-learn library in Python was used to implement linear and RBF kernels and the required code developed for PFW kernel. The experiment was performed by using a computer equipped with a Core i7 processor and 8 gigabytes RAM. As shown in Table 3, the training and classification processes consumed 46,865 microseconds for PFW, while this value rose to 421,775 and 437,390 microseconds for linear and RBF kernels, respectivelyapproximately nine times faster.

Classification Experiment Area
The classification problem tackled was distinguishing the least significant bit replacement (LSBR) steganographied images from clean ones. Steganography is the practice of hiding secret data within the body of digital media, like image and audio, to conceal its presence. A range of techniques is devised for this purpose, including least significant bit (LSB), discrete wavelet transform (DWT) and direct cosine transform (DCT). Amongst these, LSB is the most widely used method, as it is simple to implement and delivers a high capacity for data hiding. It replaces the least significant bits of the host file with the bits of the secret message. This replacement imposes some noise on the body of the host file.
As a response to steganography, steganalysis is the science of uncovering covert communications and thereby defeating those who create these, through a deep structural and technical understanding of steganographic methods [32,33]. Steganalysis, normally by means of an SVM classifier, measures image data anomalies for the likely signs of data embedding.
To evaluate the accuracy of the PFW classifier, we used it for classifying the extracted features in an image steganalysis technique called pixel similarity weight (PSW) [31]. Hiding external data within an image declines the color correlativity that exists between its pixels. To discover if an image is carrying a secret message, PSW steganalysis analyzes the color correlativity of all image pixels with their neighboring pixels, and from these then comes up with the final decision of whether the image is clean or not. Sections 5.2.2 and 5.2.3, below, give more insight on how the features are extracted from the images. Full technical details are accessible in Reference [31].

Classification Features
Images comprise three pixel classes: flat, smooth and edgy. Each of these has a certain percentage of pixels with an identical color in their neighboring zones. However, this level of correlativity falls when external data are embedded in the body of the image [31]. Thus, the selected classification feature in PSW steganalysis is the percentage of pixels with identical color that are immediately connected to the analyzed pixel, in the first, second and third neighboring zones.
Capacity in steganalysis is defined as the number of the replaced bits in one byte. If it exceeds a certain level, the generated artifacts reveal the presence of the hidden secret. Therefore, PSW steganalysis analyzes the images for 1-bit and 2-bit per byte embedding, up to a threshold that generates no artifacts. Table 4, immediately below, meanwhile presents the average statistical properties of the pixel classes in the first, second and third neighboring zones, before and after embedding. The numbers clearly show how increasing the embedding ratio progressively reduces the level of color correlativity of the pixels. Table 4. Average color correlativity of flat, smooth and edgy pixel classes before and after LSB replacement [31]. While the average color correlativity of pixels noticeably drops by increasing percent of data embedding in a very sensible way, the falls only change the density of each data group. In other words, instead of having linearly classifiable groups, they are overlapped. This means that the results cannot reliably be classified linearly.

Training and Testing Image Sets Formation
In order to train the classifier, three distinct image sets, each consisting entirely of one of these pixel classes, are required. Therefore, what is required is a database meeting the three pixel-class definitions [34,35] manually constructed by 150 images and then embedded in 1-bit and 2-bit per byte-totally 450 images. The pixel similarity weight feature then extracted from the clean and embedded images to construct three reference profiles: 0-bit (clean), 1-bit, and 2-bit. Table 5, below, presents the fundamental structure of the PSW steganalysis profiles. Table 5. Fundamental structure of a reference profile in PSW steganalysis [31]. Flat pixelszone 1 Edgy pixelszone 2 To perform steganalysis, PSW extracts the nine feature values from all pixels and then calculates the average of each one. In the next step, it computes the membership degree of the nine average feature values in comparison with the three profiles, according to Equation (16). To make the final decision, it then produces the average membership degree of each profile, using Equation (17)-three normalized membership degrees corresponding to the three profiles. The profile which achieves the highest membership degree detects which one of the classes that image belongs to: 0-bit, 1-bit or 2-bit embedded. More comprehensive technical details about these processes are available in our separate article entitled "PSW Statistical LSB Image Steganalysis" [31].
To evaluate the accuracy of our PFW kernel, the statistical features of the training image set were used both to construct PFW reference profiles and to train the linear and RBF SVM kernels. The PSW steganalysis technique [31] was then applied to the three kernels, to discriminate clean images from embedded ones. The accuracy of the kernels was then evaluated against the UW image database, a set of 1333 images in three versions of 0-bit (clean), 1-bit and 2-bit embedded. The dimensions of UW images vary between 512*768 and 589*883. The images are mainly of tourist sites, including buildings, objects, people and scenery, in different ranges of lighting, tone and contrast.

Comparison and Discussion
The accuracy results from our tests of the three kernels in classifying clean and embedded copies of the UW images are presented in Table 6, below. These show that, when the data groups have overlapping areas, the accuracy of the classical linear SVM kernel drops, whereas the PFW kernel can still effectively discriminate between different images. In fact, the PFW outperforms the linear kernel by a large margin, producing a minimum of 26.021% and a maximum of 40.425% greater accuracy. Comparing PFW and RBF kernels reveals that RBF outperforms PFW by 1.251% in 0-bit embedding ratio, while in the two other ratios (1-bit and 2-bit), PFW is more efficient, by 2.894% and 8.693%, respectively. To further illustrate the classification accuracy of each kernel, a sample of pixel classification is presented in Figure 6, below. Image A is the original image that was broken down into the three pixel classes. The three images labeled B, C and D are the smooth pixel class sub-images produced by the PFW, RBF and linear kernels, respectively. These images clearly show that the classification of pixels by the PFW kernel is more accurate and smoother. Moreover, the PFW kernel has detected and classified a substantially larger portion of the pixels.

Conclusions
Supervised learning is a class of machine learning in which a function is inferred from labeled training data. The inferred function then determines the class label of unseen instances. One type of supervised learning, SVM (support vector machine), predicts the categories of new examples, based on a given set of training examples. SVM maps the data into a space in which the categories are dividable by a clear gap. However, such mapping is not always sufficient to produce clearly separable categories. In such cases, a further step is required to transform the data from their original space to feature space, increasing the computational cost of training and classification.
Our PFW (polygonal fuzzy weighted) kernel is designed to map data accurately, without transforming them to feature space, even when groups are not linearly separable. The kernel accurately predicts the class of new instances in the original space, even if the groups have overlapping areas. It is structured based on Gaussian distribution, standard deviation, the threesigma rule and a polygonal fuzzy membership function. PFW generates one profile for each data class, based on the statistics extracted during the training process. For the purposes of classification, the new instances are then compared with the reference profiles, based on a polygonal fuzzy membership function: The highest degree of membership achieved defines the prediction result.
To test the accuracy of our PFW kernel, we used it together with RBF and conventional linear kernels as classification engines for PSW steganalysis of the same set of 1333 images, whose classification feature data had overlapping data groups. The results showed that the PFW clearly outperformed the linear kernel in predicting the class labels of the images-yielding a minimum of 26% and as much as 40% higher classification accuracy. PFW kernel in two out of three image classes outperformed RBF kernel by 3% and 9%, while RBF delivered better results, by 1% in one class label.