Entropy “ 2 ”-Soft Classification of Objects

A proposal for a new method of classification of objects of various nature, named “2”-soft classification, which allows for referring objects to one of two types with optimal entropy probability for available collection of learning data with consideration of additive errors therein. A decision rule of randomized parameters and probability density function (PDF) is formed, which is determined by the solution of the problem of the functional entropy linear programming. A procedure for “2”-soft classification is developed, consisting of the computer simulation of the randomized decision rule with optimal entropy PDF parameters. Examples are provided.


Introduction
The problem of object classification is highly relevant in contemporary theoretical and applied science.Objects, which are subject to classification, can be text documents, audio, video and graphic objects, events, etc.The "m"-soft classification means the attribution of the object to the appropriate m class with a certain probability, unlike the "m"-hard classification, when no alternative distribution of objects by classes is performed.Here, we consider "2"-soft classification, which is the basis for the classification by m classes.
"2"-soft classification is useful in many applied problems, with the exception, perhaps, of "ideal" ones, where all classes, object specification, and data are absolutely accurate.The fact is that real classification problems are immersed in significantly undefined environments.When it comes to data, they are received with errors, omissions, questionable reliability, and different timelines.The formation of decision rule models and their parameterization is not a formalized and subjective process, depending on the knowledge and experience of a researcher.By minimizing an empirical risk of decision rule model in the learning process, we get parameter evaluations for the existing amounts of data (precedents) and for the accepted parameterized model, i.e., evaluations are conditional.How will they, and consequently results of the classification, would behave with other precedents and with other parameterization model remains unclear.Methods of "2"-soft classification are directed to indicate a possible approach to overcome uncertainty factors.The idea is to make a decision rule randomized, and not arbitrarily randomized, but so that its entropy, as a measure of uncertainty, was at the maximum.It would allow for generating an ensemble of the best solutions at the highest uncertainty.However, among soft and hard classification, fundamental differences exist: the structures of their procedures are similar and based on a more general concept of machine learning by precedents.A huge amount of work is devoted to this issue.Relevant references to them can be found in monographs [1][2][3][4][5][6][7][8], lectures [9,10] and reviews [11][12][13].The recent fundamental works [6,14,15] clarify the vast diversity of classification algorithms and its learning procedures.
Within the general concept of machine learning, its modification had been proposed: Randomized Machine Learning (RML) [16].An idea of the randomization is expanding to data and parameters of decision rules.This means that the model parameters of decision rules and data errors are assumed to be randomized in an appropriate way.The difference from existing machine learning procedures is that RML procedures are built not for optimal evaluations of the model parameters, but their probability density function (PDF) and evaluation of the "worst" data errors of PDF.In the RML, as a criterion of evaluation optimality, generalized information entropy is used, maximization of which was carried out on a set described by the system of empirical balances with collections of learning data.
The principle of maximum entropy has already been used in the domain of machine learning-for example, for speech recognition and text classification problems [17] and even for deep neural net parameters estimation [18].The main advantage of this technique is its robustness to over-fitting in the presence of data errors within small data sets [19,20].It is demonstrated by the classification experiments with the additive random noise presented in this paper.
We will assume that dimensions of the vectors in both collections are the same, i.e., (e (i) , t (j) ) ∈ R n .Collection E is used for learning, and collection T for its classification and testing.
The objects in both collections are marked by belonging to a corresponding class: if object e l or t k belongs to the first class, it will be assigned 1, or 0 if it belongs to the second one.Consequently, the learning collection is characterized by a vector of answers y = {y 1 , . . ., y h } with components equal to 0 or 1, and a testing collection to vector of responses z = {z 1 , . . ., z r } (at the end of last century it was known as learning with a teacher [2]).Numbers of these vector components correspond to object numbers in learning and testing collections.

Learning
Availability of learning collection allows for hypothesizing about the existence of function (decision rule) F : E × S 2 → y.The learning problem is to determine parameterized function F(a), which approximately describes function F. Function F(a) characterizes the model of decision rule.Being under the "soft" classification, the randomized model occurs as a model of decision rule, i.e., it has randomized parameters a.Its input are vectors {e (1) , . . ., e (h) }, and the output ŷ(a) depends on randomized parameters of a.As such, we choose a model of a single-layer neural net [21]: where , In the randomized Models ( 2) and ( 3) parameters a = {a 1 , . . ., a (n+2) } are of interval type: Their probabilistic properties are characterized by PDF P(a), which is defined over set A.
Since parameters a are random, then for each object e i , an ensemble Ŷ (i) of random numbers ŷ(i) (a) occurs for the interval [0, 1].We define it as an average: Therefore, according to RML [16] general procedure, the problem of "2"-soft classification is represented as follows: A where P(a) belongs to class C 1 of continuously differentiable functions.This is a problem of functional entropy-linear programming, which has an analytical solution, such as entropy optimal PDF P * (a | θ), parameterized by Lagrange multipliers θ: where ŷ(i) (a) is defined by the Equality (2), and Lagrange multipliers are determined from the Equation (8).

Testing
At this point, a collection of objects T is used, which are characterized by the vector of responses z = {z 1 , . . ., z r } with known objects belonging to grade 1 or 2. Vector z will be used to evaluate the quality of testing.In the testing procedure itself, only T objects collection will be used.
The subject of the testing is randomized decision Rules ( 2) and ( 3) with entropy optimal PDF function of parameters.At the same time, a trial sequence of Monte Carlo is implemented with volume N, every one of which is generated by a random vector a with appropriate entropy optimal PDF function P * (a) (6)- (8).Assume that, as a result of these tests, it was found that the first object from the testing collection was assigned to the first class N 1 times and N − N 1 times to the second class; .... k-th object was assigned to the first class N k times and (N − N k )-to the second class, etc.For a sufficiently large number of tests, the empirical probability can be determined Therefore, the testing algorithm can be represented as follows (where i is the number of the object): Step 1-i.In accordance with the optimal PDF function, a set of output values is generated with entropy optimal Models (2) and ( 3), comprising N random numbers from an interval [0, 1].
Step 2-i.If the random number from this set is larger than 1/2, then the object t (i) belongs to the class 1.If it is less than 1/2, then it belongs to class 2.
Step 3-i Empirical probabilities are determined (11).As a result of the functioning of this procedure, any object can be defined in one of two classes with a certain probability, which reflects an uncertainty within data and models of decision rule.
The transition to a hard classification can be accomplished by fixing the threshold of probabilities, an object which above it belongs to the relevant class.The number of objects that can be "hard"-classified depends on the threshold value.It is not difficult to find that when thresholds are greater than 0.5, not all objects are classified, but with more than 0.5 probability.However, at thresholds less than 0.5, all are classified, but with less than 0.5 probability.

Model Examples of "2"-Soft Classification
In this section, we present the model experiments conducted in accordance with the proposed learning algorithm.It should be noted that all data sets are synthetic and generated manually with a standard random number generator.The first series of experiments aims to introduce the proposed computational procedure and should be considered as illustrative examples.On the other hand, the last example in the next section is more important and demonstrates the advantages of the probabilistic approach for classification in the presence of data errors.

Soft "2"-Classification of Four-Dimensional Objects
Consider that the objects characterized by four features are coordinates of vectors e and t.

Testing
At this stage, a set of objects T = {t 1 , . . ., t r } is used, where each element of the set is characterized by vector t (j) ∈ R (4) .The generated array of (500 × 4) four-dimensional random vectors t (i) , i = 1, 500, with independent components evenly distributed in intervals [0, 1].Then, the algorithm of "2"-soft classification is applicable.Figure 3 shows the empirical probabilities p

Two-Dimensional Objects "2"-Soft Classification
Consider the objects characterized by two features that are coordinates of the vectors e and t.

Learning
The learning collection consists of three objects, every one of which is described by two attributes, and the values of which are shown in Table 1.The values of parameters α, ∆ and intervals for random parameters a correspond to Example 1. Lagrange multipliers for the entropy-optimal PDF (9) have the following values: θ * = {9.6316;−18.5996; 16.7502}.Entropy-optimal PDF function P * (a | θ) for this learning collection is as follows: .

Testing
All parameters of this example correspond to Example 2. Figure 5 shows empirical probabilities 2 of belonging of the t i -object to Classes 1 and 2 (i = 1, 500).

Experimental Studies of "2"-Hard/Soft Classifications in Presence of Data Errors
"2"-soft classification, based on entropy randomization decision rules, creates a family of "2"-hard classifications, parameterized by thresholds of belonging probabilities.This family complements "2"-hard classifications based on machine learning methods, particularly using the method of least squares [9].

Data
The experimental study of the soft and hard classifications was performed on simulated data with model errors.All the data for the objects were labeled in order of belonging to one of two classes.This information was used for learning the model of decision rule, and during the method testing to evaluate the classification accuracy.
The objects of classification are characterized by four-dimensional vectors: Vectors of both collections were chosen in a random order, and they were evenly distributed on a four-dimensional unit cube.Data errors are modeled by random and evenly distributed four-dimensional vectors ξ ∈ [− ξ− , ξ+ ], where ξ± = ±0.1.
To mark the belonging of vectors ū to one of the two classes, the number generation procedure from interval [0, 1] is applied as follows: where Let us recall that the values of the sigmoid function from interval [0.5, 1] correspond to the first class, and from interval [0, 0.5) to the second class, i.e., Thus, numerical characteristics of objects components of u i s , s = 1, 4; i = 1, 510 vectors serve as their features, and c i values refer them to a certain class.

Randomized Model (Decision Rule)
In the experimental study, we used a model that coincides by its structure with ( 15) and ( 16), but has randomized parameters and noises: where a = {a 1 , . . ., a 4 } are randomized parameters of the interval type.For example, a ∈ A = [−1, 1]; , which exists as a continuously differentiable function.
Figure 6 shows a two-dimensional section of entropy-optimal PDF function P * (a, θ * ).

Testing of Learning Model: Implementation of "2"-Soft Classification
The process of randomized classified algorithm testing involves a generation of a random set of model parameters according to received entropy-optimal density functions.To generate an ensemble of random values, an algorithm of Metropolis-Hastings is used [22].
In this example, the vectors set was generated [a (k) 4 ] * }, k = 1, 100, with entropy-optimal PDF P * (a, θ * ) (22), defining 100 implementations of decision rule classification models.For each object (i) from the testing collection, we obtain an ensemble of numbers from the interval [0, 1] Along this ensemble, the objects are distributed into classes in accordance with rule The probabilities of belonging to Classes 1 and 2 are calculated as follows: Here, N 1 and N 2 are the numbers of tests in which the objects were classified as Classes 1 and 2, respectively.Figure 7 shows the value distribution of empirical probability of belonging to Classes 1 and 2 for testing sample objects, which characterize "2"-soft classification with entropy optimal linear decision rule.It was noted above that with "2"-soft classification, you can generate families of "2"-hard classification, assigning various thresholds η of empirical probability.Belonging to a class is defined by the following condition: Let us determine the accuracy of classification as: where L is the length of the test collection, and L tr is the number of correct classifications.The research of the above example shows the existence of dependence between the accuracy η and the threshold δ. Figure 8 shows this dependence, where we can see that the quantity η attains a maximum value equal to 78.5% at δ = 0.19.However, "2" hard classification, which uses the decision rule (26) with non-randomized parameters and the method of least squares to determine their values, yields a 66% accuracy (regardless of the threshold).
This result means that a traditional exponential model widely used for classification problems [6,9] could be improved by statistical interpretation of its output.In this case, the proposed randomized machine learning technique that evolves the generation of entropy-optimal probability functions, and variation of soft decision threshold boosts the accuracy of classification to nearly 80 percent.

Conclusions
A method for "2"-soft classification was proposed, which allows for referring objects with calculated empirical probability to one of two classes.The latter is determined by the Monte Carlo method with the use of the entropy-optimal randomized model of the decision rule.A corresponding problem is formulated for the maximization of entropy functional on the set, a configuration of which is determined by balances between the real input data and average output of the randomized model of the decision rule.
The problem of "2"-soft classification for the case of existence of data errors simulated by the additive noise evenly distributed in the parallelogram.The entropy-optimal estimations of probability distribution density for model parameters and for noises, which are the best at the maximum indeterminateness in terms of entropy, were obtained.We performed an experimental comparison of "2"-hard and "2"-soft classifications.An existence of the classification threshold interval was revealed, whereby a precision of "2"-soft classification (the number of correct answers) was increased by 20.Examples illustrating the proposed method are provided.

Figure 1 1 ŷFigure 1 .
Figure1shows a graph of the sigmoid function with parameters: "slope" α and "threshold" ∆.The function of sigm(x) in the interval [1/2, 1] corresponds to the first class and the values in the interval [0, 1/2)-to the second class.
of t i -object to Classes 1 and 2.

Figure 7 .
Figure 7.The empirical probabilities of belonging to classes 1 and 2.

Figure 8 .
Figure 8. Dependence of RML model accuracy on the threshold δ.

Table 1 .
Learning data example.