Small Stochastic Data Compactification Concept Justified in the Entropy Basis

Kovtun, Viacheslav; Zaitseva, Elena; Levashenko, Vitaly; Grochla, Krzysztof; Kovtun, Oksana

doi:10.3390/e25121567

Open AccessArticle

Small Stochastic Data Compactification Concept Justified in the Entropy Basis

by

Viacheslav Kovtun

^1,*

,

Elena Zaitseva

²

,

Vitaly Levashenko

²,

Krzysztof Grochla

¹

and

Oksana Kovtun

³

¹

Internet of Things Group, Institute of Theoretical and Applied Informatics Polish Academy of Sciences, Bałtycka 5, 44-100 Gliwice, Poland

²

Department of Informatics, University of Žilina, 010 26 Žilina, Slovakia

³

Department of the Theory and Practice of Translation, Faculty of Foreign Languages, Vasyl’ Stus Donetsk National University, 600-Richchya Str., 21, 21000 Vinnytsia, Ukraine

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(12), 1567; https://doi.org/10.3390/e25121567

Submission received: 13 October 2023 / Revised: 15 November 2023 / Accepted: 18 November 2023 / Published: 21 November 2023

(This article belongs to the Section Multidisciplinary Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Measurement is a typical way of gathering information about an investigated object, generalized by a finite set of characteristic parameters. The result of each iteration of the measurement is an instance of the class of the investigated object in the form of a set of values of characteristic parameters. An ordered set of instances forms a collection whose dimensionality for a real object is a factor that cannot be ignored. Managing the dimensionality of data collections, as well as classification, regression, and clustering, are fundamental problems for machine learning. Compactification is the approximation of the original data collection by an equivalent collection (with a reduced dimension of characteristic parameters) with the control of accompanying information capacity losses. Related to compactification is the data completeness verifying procedure, which is characteristic of the data reliability assessment. If there are stochastic parameters among the initial data collection characteristic parameters, the compactification procedure becomes more complicated. To take this into account, this study proposes a model of a structured collection of stochastic data defined in terms of relative entropy. The compactification of such a data model is formalized by an iterative procedure aimed at maximizing the relative entropy of sequential implementation of direct and reverse projections of data collections, taking into account the estimates of the probability distribution densities of their attributes. The procedure for approximating the relative entropy function of compactification to reduce the computational complexity of the latter is proposed. To qualitatively assess compactification this study undertakes a formal analysis that uses data collection information capacity and the absolute and relative share of information losses due to compaction as its metrics. Taking into account the semantic connection of compactification and completeness, the proposed metric is also relevant for the task of assessing data reliability. Testing the proposed compactification procedure proved both its stability and efficiency in comparison with previously used analogues, such as the principal component analysis method and the random projection method.

Keywords:

machine learning; data analysis; entropy; data reliability; small data; stochastic data; compactification; completeness; parametric optimization

1. Introduction

The most valuable resource in the information society is data. It seems that “there is no such thing as too much data”, but let us try to look at this catchphrase as data scientists. The “curse of dimensionality” is a problem that consists of the exponential growth of the amount of data that has occurred simultaneously with the growth of the dimensionality of the space for data representation. This term was introduced by Richard Bellman in 1961. Scientists dealing with mathematical modelling and computational methods were the first to face this problem. Now, this problem is faced again as machine learning and artificial intelligence methods are implemented. In this study, we will illustrate the relevance of this problem using the k-nearest neighbour method, which is popular for solving classification problems [1,2,3,4]. The essence of the method is as follows: the instance belongs to the same class as that which the majority of its nearest neighbour instances in the parametric space belong. To ensure high-quality work with this method, the saturation density of the parametric space with instances must be sufficiently high. How are the parametric space dimensions, the density of instances, and their number related to each other? To uniformly cover a unit interval [0, 1] with a density 0,01, we need 100 points, where the coverage density is defined as the ratio of the number of points evenly distributed in the target interval to the length of the latter. Now, imagine a 10-dimensional cube. To achieve the same coverage density, we already need 10²⁰ points, that is, 10¹⁸ times more points compared to the original 1-dimensional space. This example demonstrates the reason for the inefficiency of the brute force method in typical machine learning problems (classification, clustering, and regression) [5,6,7,8,9]. The paradox is that it is impossible to solve the mentioned applied problems using a small number of parameters and achieve adequate results. One can simply turn a blind eye to the problem of dimensionality, which is the paradigm of deep learning, where using non-parameterized models achieves a significant increase in their quality despite the colossal increase in the number of calculations and accepting as an axiom the potential instability of the training process. But this recipe is unacceptable in the context of the machine learning ideology. The following Table 1 contains a more detailed comparison of these two methods.

Therefore, managing the dimensionality of data while preserving their quality and the representativeness of the parametric space is an urgent scientific problem for machine learning.

The most widely used method for reducing data dimensionality is singular value decomposition (SVD, [10,11,12]). The matrices obtained as a result of SVD have a very specific interpretation in the machine learning methodology. They can be used according to the proven method both for principal component analysis (PCA, [13,14,15]) and (with certain reservations) for non-negative matrix factorization (NMF, [16,17,18]). SVD can also be used to improve the results of independent component analysis (ICA, [19,20,21]). It is convenient to apply SVD because there are no restrictions on the structure of the original data matrix (square when using the LU [22] or Schur distribution [23]; square, symmetric, or positive definite when using the Cholesky distribution [24]; matrix with positive elements when applying NMF). The essence of SVD is the representation of the original matrix X as a product of matrices of the form

X = U Σ V^{*}

, where U is a unitary matrix of order m and ∑ is a rectangular diagonal matrix of dimension (m × n), where m is a number of instances and n is a number of measured observables, with singular elements on the main diagonal and

V^{*}

is a matrix of order n, obtained as a result of conjugate transpose of the matrix V. The matrix ∑ is important for the dimensionality management problem. The squared singular elements of this matrix are interpreted as the variance σ² of the corresponding component. Based on the value of these variances, the researcher can select the required number of components. What is the best value

\sum_{m} σ^{2}

? Some recommend maintaining the inequality

\sum_{m} σ^{2} \geq 0, 90

, while others believe that

\sum_{m} σ^{2} \geq 0, 50

is sufficient. The original answer to this question is provided by Horn’s parallel analysis based on Monte Carlo simulation [25]. The disadvantage of both SVD and PCA is the high computational complexity of obtaining a singular distribution (well-known randomized algorithms [26] slightly mitigate this limitation). A more serious limitation is the sensitivity of SVD/PCA to outliers and the type of distribution of the original data. Most researchers believe that SVD/PCA works consistently with normally distributed data, but it has been empirically found that, as the data dimensionality increases, there are exceptions even to this rule. Therefore, SVD/PCA methods cannot guarantee the stability of the data dimensionality reduction procedure.

NMF is used to obtain the decomposition of a non-negative matrix

X_{(m \times n)}

into non-negative matrices

W_{(m \times k)}

and

H_{(k \times n)}

: X = WH. By choosing

k < < m, n

, we can solve the problem of reducing the dimensionality of the original matrix quite effectively. The problem is that, unlike SVD, finding the X = WH decomposition does not have an exact solution. There are specialized formulations of quadratic programming problems, such as the support vector machine (SVM, [27,28,29]) [30]. However, we understand that this means that NMF has the same limitations that have been pointed out for SVD/PCA.

The ICA method crossed into machine learning from the signal processing theory and, in its original formulation, was intended for the decomposition of a signal with additive components. At the same time, it was believed that these components have an abnormal distribution, and the sources of their origin are independent. To determine independent components, either minimization of mutual information based on Kullback–Leibler divergence [19] or minimization of “non-Gaussianity” [20,21] (using measures such as kurtosis coefficient and negentropy) are used. In the context of the dimensionality reduction problem, the application of ICA is trivial: to represent the input data as a mixture of components, divide them and select a certain number. There is no analytically consistent criterion for component selection.

We have often mentioned machine learning methods in the context of the data dimensionality management problem. However, there are competitors originating from the artificial intelligence field, i.e., the autoencoders [31,32,33]. This is an original class of neural networks, created so that the signal given to the input layer is reproduced as accurately as possible at the output of the neural network. The number of hidden layers should be at least one, and the activation functions of neurons on these layers should be non-linear (most often sigmoid, tanh, ReLu). If the number of neurons in the hidden layer is less than the number of neurons in the input layer, and we reproduce the input signal at the same time with sufficient accuracy as the output of the trained autoencoder, then the parameters of the neurons of the hidden layer are a compact approximate representation of the input signal. The advantage of this approach is that the neural network works for us. It is also very easy to orient the autoencoder to solve the data dimensionality increasing problem: it is sufficient that there are more neurons on the hidden layer than on the input layer. Disadvantages are also known: empirical search for the optimal configuration of the neural network (number of hidden layers, number of neurons on those layers, and selection of their activation functions), empirical selection of both the training algorithm and its parameters), and the neural network regularization methods (L1, L2, dropout). And we have not yet focused on the specific drawback of autoencoders, i.e., the tendency to degenerate hidden layers in the training process.

In recent years, there has been a growing interest in the research of data analysis, particularly within the context of regression analysis applied to inhomogeneous datasets. The existing research [34] explores the challenges presented by data that can be gathered from various sources or recorded at different time intervals, resulting in inherent inhomogeneities that complicate the process of regression modelling. The conventional framework of independent and identically distributed errors, typically associated with a single underlying model, is inadequate for handling such data. As the authors claim, traditional alternatives, like time-varying coefficients models or mixture models, can be computationally burdensome and impractical. So, the paper [34] proposes an aggregation technique based on normalized entropy (neagging) in contrast with such well-known aggregation procedures as bagging and magging. This approach has shown great promise, and the paper provides practical examples to illustrate its effectiveness using real-world datasets across various scenarios. However, the authors position their solution for working with large amounts of data or Big data. The issue of applicability of the mentioned procedures for compactification of small variable data has not been considered.

Taking into account the strengths and weaknesses of the mentioned methods, we will formulate the necessary attributes of scientific research.

The research object is the process of stochastic empirical data collection compactification.

The research subjects are probability theory and mathematical statistics, information theory, computational methods, mathematical programming methods, and experiment planning theory.

The research purpose is to formalize the process of finding the optimal probability distribution density of stochastic characteristic parameters of the empirical data compactification model with the maximum relative entropy between the original and compactified entities.

The research objectives are:

-: formalize the concept of calculating the variable entropy estimation of the probability distribution density of the characteristic parameters of the stochastic empirical data collection;
-: formalize the process of the stochastic empirical data collection compactification with the maximization of the relative entropy between the original and compactified entities;
-: justify the adequacy of the proposed mathematical apparatus and demonstrate its functionality with an example.

The Motivation. One derives quantitative information on a class of objects by measuring a set of observables (“characteristic parameters”) on a sample of objects taken from the class of interest. A set of values taken by the chosen observables on one of the objects is an instance. One of the basic problems in general data analysis is finding the optimal number of instances and the optimal (minimal) number of observables, that allow, in the presence of noise, to build regression models, estimate correlations between observables, and classify and cluster the objects in a machine learning approach. In this perspective, which is a very relevant one, the authors propose a model of noisy data based on a conditional, relative entropy [Equation (6)]. The article introduces a consistent and tunable method of “compactification” that performs quite well concerning other established methods, such as PCA and random projection methods.

2. Models and Methods

2.1. Statement of the Research

Let us characterize the researched process using a model in terms of linear programming, that is, by a function

z = f (v, w)

that summarizes n weighted characteristic parameters

v \in ℝ^{n}

, where the weights w are interval stochastic values:

w \in W = \{w^{-} \leq w \leq w^{+}\}

, the properties of which are characterized by the probability distribution density P(w).

Suppose that, as a result of m observations of the investigated process, empirical data with the structure

〈V, y〉

were obtained, where V is the training collection and each empirical parametric vector

v^{(i)} = (v_{i 1}, \dots, v_{i n}) \in V

,

v^{(i)} \in ℝ^{n}

, corresponds to an empirical initial value

y_{i} \in y

,

\forall i = \bar{1, m}

. When substituting data V into the model z, the equality of

z = \{z_{i}\} = V w, i = \bar{1, m},

(1)

must be fulfilled and which is provided by the training of the model z.

We consider that the values y_i of the original empirical vector y contain interference, which are represented by stochastic vector values

ε_{i} \in ε

,

i = \bar{1, m}

,

ε \in E = \{ε^{-} \leq ε \leq ε^{+}\}

, with the probability density function

L (ε)

of a stochastic vector

ε

. Taking into account interferences, we present expression (1) as

u = z + ε = V_{(m \times n)} w + ε,

(2)

where

u \in U = [u^{-}, u^{+}]

,

u^{-} = V w^{-} + ε^{-}

,

u^{+} = V w^{+} + ε^{+}

.

In the context of the formulated equation, the machine learning methodology is focused on determining the estimates

\overset{⌢}{P} (w)

and

\overset{⌢}{L} (ε)

of the corresponding probability distribution densities. The basis for this is model (2) and a set of empirical data V. Based on the known estimates of

\overset{⌢}{P} (w)

and

\overset{⌢}{L} (ε)

, it is possible to outline the domain of stochastic vectors

u \in U

. Such a problem will be referred to as a d-problem. The authors devoted the article [35] directly to the solution of the d-problem.

On the other hand, the problem of compactification of the parametric space V of model (2) is solved by reducing the dimension of the characteristic parameters from n to r units, r < n, is also of practical value. Such a problem will be referred to as a c-problem.

Suppose that, as a result of the compactification of the original empirical data with the structure

〈V, y〉

, a shortened parametric space

ℝ^{r}

is obtained where each parametric vector

y^{(i)} = (v_{i 1}, \dots, v_{i r}) \in Y

,

y^{(i)} \in ℝ^{r}

, or

i = \bar{1, m}

, corresponds to the original interval stochastic value

a \in A = \{a^{-} \leq a \leq a^{+}\}

,

j = \bar{1, r}

, with the probability distribution density A(a).

To describe compactified data

〈Y, a〉

, we define the model

b = Y_{(m \times r)} a, a \in R^{r}, b \in R^{m},

(3)

and the vector of observations is expressed as

s = b + ξ,

(4)

where the stochastic vector

ξ

is formed by interval values

Ξ = \{ξ^{-} \leq ξ \leq ξ^{+}\}

with the probability distribution density

Z (ξ)

. The vectors s defined by expression (4) are interpreted as

S = [s^{-}, s^{+}]

,

s^{-} = Y a^{-} + ξ^{-}

,

s^{+} = Y a^{+} + ξ^{+}

.

Our further actions will be aimed at formulating:

-: optimality criterion of the compactified data matrix Y_(m×r);
-: a method for calculating the elements of the optimal compactified data matrix Y_(m×r);
-: a method for comparing the probability distribution densities of outputs of models (2) and (4) as an indicator of the effectiveness of the proposed compactification concept.

2.2. The Concept of Entropy-Optimal Compactification of Stochastic Empirical Data

Let us focus on the analytical formalization of the entropic properties of empirical data, summarized by the matrix V. Let there be m independent instances in the collection of class X, each of which is characterized by the values of n attributes (characteristic parameters). The selection of instances in the collection X is random. In this context, the matrix X summarizes

x_{i j}

,

i = \bar{1, m}

,

j = \bar{1, n}

, stochastic attributes whose values are real numbers:

x_{i j} \geq 0

,

i = \bar{1, m}

,

j = \bar{1, n}

, satisfying the condition

\sum_{i = 1}^{m} \sum_{j = 1}^{n} x_{i j} \leq W

, where W is determined by the region of origin of instances of the class X.

We normalize the values of the elements of the matrix

X

relative to the selected scale with a resolution of Δ:

h_{i j} = ⌈x_{i j} / Δ⌉

,

i = \bar{1, m}

,

j = \bar{1, n}

,

\sum_{i = 1}^{m} \sum_{j = 1}^{n} h_{i j} \leq A \geq ⌈W / Δ⌉

. The step Δ is chosen to ensure sufficient variability of the resulting integer values of the stochastic elements of the matrix

H = (h_{i j})

,

i = \bar{1, m}

,

j = \bar{1, n}

.

Let us formalize the process of forming the values of the elements of the matrix H. Let us have A atomic units of the resource, which are distributed among m × n elements of the matrix H, and the probability of a resource unit falling into the element

h_{i j}

is characterized by the probability

p_{i j}

,

i = \bar{1, m}

,

j = \bar{1, n}

. The probability distribution of such a process is defined as

P (H) = A! \prod_{i = 1}^{m} \prod_{j = 1}^{n} \frac{p_{i j}^{h_{i j}}}{h_{i j}!} .

(5)

If the Moivre–Stirling approximation of factorials of large numbers is applied to the logarithmic representation of expression (5), we obtain an expression that characterizes the process described above based on the relative entropy:

E (H |P) = - \sum_{i = 1}^{m} \sum_{j = 1}^{n} h_{i j} \ln \frac{h_{i j}}{p_{i j}},

(6)

where

P = (p_{i j})

,

i = \bar{1, m}

,

j = \bar{1, n}

.

Taking into account the proposed physical interpretation of the process of the matrix H values formation, it is appropriate to introduce such a characteristic parameter as the resource units a priori distribution, i.e.,

V = (v_{i j} = p_{i j} A)

,

i = \bar{1, m}

,

j = \bar{1, n}

. Taking this parameter into account, expression (6) can be redefined as

E (H |V) ≜ - \sum_{i = 1}^{m} \sum_{j = 1}^{n} h_{i j} \ln \frac{h_{i j}}{v_{i j}} .

(7)

Equality (7) is defined with accuracy up to the constant AlnA. The essential connection between the sources of origin of the elements of the matrices X and H allows us to define the cross-entropy function as

E (X |V) ≜ - \sum_{i = 1}^{m} \sum_{j = 1}^{n} x_{i j} \ln \frac{x_{i j}}{v_{i j}} .

(8)

Based on expression (8), we write:

E (G |P) = - W \ln \frac{W}{A} - \sum_{i = 1}^{m} \sum_{j = 1}^{n} g_{i j} \frac{g_{i j}}{p_{i j}},

(9)

where

g_{i j} = x_{i j} / W \in [0, 1]

,

i = \bar{1, m}

, and

j = \bar{1, n}

, and the second term is the relative uncertainty characteristic of the stochastic matrix X.

Function (8) is concave for the entire range of values of the argument X and reaches a single extremum at the point

x_{i j}^{*} = v_{i j} / e

,

e = 2, 718

,

i = \bar{1, m}

,

j = \bar{1, n}

. The extreme value of function (8) is equal to

E_{\max} (x^{*} |V) = \frac{1}{e} \sum_{i = 1}^{m} \sum_{j = 1}^{n} v_{i j} .

(10)

The value (10) characterizes the maximum uncertainty of the matrix X for a defined matrix V. Let us emphasize other useful properties of function (8).

Let us define a matrix L with elements

l_{i j} (x_{i j}, v_{i j}) = \ln (x_{i j} / e v_{i j})

,

i = \bar{1, m}

, and

j = \bar{1, n}

. Considering

V = L (X, V)

, expression (8) can be rewritten as

E (X |V) = E (X, L (X, V)) = - \sum_{i = 1}^{m} \sum_{j = 1}^{n} x_{i j} l_{i j} (x_{i j}, v_{i j}) = Sp (X L^{T} (X, V)) = Sp (L (X, V) X^{T}),

(11)

where the symbols Sp and T represent the operations of trace finding and matrix transposition, respectively.

Based on the definition

l_{i j} (x_{i j}, v_{i j})

, we obtain the following inequality for the logarithmic function:

l_{i j} (x_{i j}, v_{i j}) \leq (x_{i j} - v_{i j}) / v_{\min}, i = \bar{1, m}, j = \bar{1, n},

(12)

where

v_{\min} = \min_{i, j} v_{i j}

.

Having transformed expression (11) and taking into account inequality (12), we determine the upper limit of cross entropy (8):

\hat{E} (X |V) = Sp (X X^{T}) - Sp (X V^{T}) .

(13)

Function (13) is concave and follows all the properties of function (8).

Consider a non-degenerate

(\det (V_{(n \times m)}^{T} V_{(m \times n)}) \neq 0)

matrix of empirical data

V_{(m, n)}

with positive elements. Let us set the desired dimension of the parametric space: r, r < n, and enter into the matrix

Q = (q_{i j} \geq 0)

,

i = \bar{1, n}

, and

j = \bar{1, r}

. We obtain a direct projection of the matrix

Q_{(n \times r)}

onto the parametric space R^mr:

Y_{(m \times r)} = V_{(m \times n)} Q_{(n \times r)}

. We obtain the inverse projection on the space R^mn using the matrix

S_{(r \times n)}

, and the values of all elements which are positive:

X_{(m \times n)} = V_{(m \times n)} Q_{(n \times r)} S_{(r \times n)}

. The dimensionality of both the obtained matrix X and the original matrix V is the same: (m × n).

Let us express the cross-entropy functional

E (X |V) = E (X_{(m \times n)} |V_{(m \times n)})

, taking into account the existence of the matrices

Q_{(n \times r)}

and

S_{(r \times n)}

:

E (X |V) = E (Q, S |V) = E (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}) = - \sum_{i = 1}^{m} \sum_{j = 1}^{n} e_{i j} (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}),

(14)

where

e_{i j} (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}) = x_{i j} (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}) \ln (x_{i j} (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}) / v_{i j}),

x_{i j} (Q_{(n \times r)}, S_{(r \times n)} |V_{(m \times n)}) = \sum_{k = 1}^{r} \sum_{l = 1}^{n} s_{k j} q_{l k} v_{i l}, i = \bar{1, m}, j = \bar{1, n} .

The optimal configuration of the values of the positive matrices Q and S in the entropy basis is described by the expression

(Q^{*}, S^{*}) = \arg \max_{(Q, S) \geq 0} E (Q, S |V) .

(15)

We will search for the extremum of the objective function (15) by the iterative gradient projection method [36,37], taking into account the need to cut off elements with negative values (observing condition

(Q, S) \geq 0

).

Let us analytically express the partial derivatives of the function

E (Q, S |V)

in terms of the arguments, i.e., the elements of matrices Q and S:

\frac{\partial E (Q, S |V)}{\partial q_{k l}} = - \sum_{i = 1}^{m} \sum_{j = 1}^{n} \frac{\partial e_{i j} (Q, S |V)}{\partial x_{i j}} \frac{\partial x_{i j} (Q, S |V)}{\partial q_{k l}}

(16)

\frac{\partial E (Q, S |V)}{\partial s_{l h}} = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{\partial e_{i j} (Q, S |V)}{\partial x_{i j}} \frac{\partial x_{i j} (Q, S |V)}{\partial s_{l h}}

(17)

where

\partial e_{i j} (Q, S |X) / \partial x_{i j} = \ln (x_{i j} / v_{i j}) + 1

,

\partial x_{i j} (Q, S |V) / \partial q_{k l} = s_{l j} v_{i k}

,

\partial x_{i j} (Q, S |V) / \partial s_{l h} = \sum_{k = 1}^{n} q_{k l} v_{i h}

,

i = \bar{1, m}

,

j = \bar{1, n}

,

k = \bar{1, n}

,

l = \bar{1, r}

, and

h = \bar{1, n}

.

Let us derive vectors

\vec{q}

and

\vec{s}

as a result vectorization of matrices Q and S, respectively. We identify the gradient vector of the relative entropy functional (14) with components (16)

\nabla_{Q} (\vec{q}, \vec{s})

. We identify the gradient vector

\nabla_{S} (\vec{q}, \vec{s})

of the relative entropy functional (14) with components (17). We initialize the iterative procedure for finding the extremum of the objective function (15) based on the gradient projection method and in terms of the introduced entities.

For the 0th iteration, we take

X^{(0)}

,

V^{(0)}

,

{\vec{q}}^{(0)} > 0

,

{\vec{s}}^{(0)} > 0

.

For the nth iteration, we write:

\begin{matrix} {\vec{q}}^{(n + 1)} = \{\begin{cases} {\vec{q}}^{(n)} + γ_{\vec{q}} \nabla_{Q} ({\vec{q}}^{(n)}, {\vec{s}}^{(n)}) \forall {\vec{q}}^{(n + 1)} \geq 0, \\ {\vec{q}}^{(n)} \forall {\vec{q}}^{(n + 1)} < 0, \end{cases} \\ {\vec{s}}^{(n + 1)} = \{\begin{cases} {\vec{s}}^{(n)} + γ_{\vec{s}} \nabla_{S} ({\vec{q}}^{(n)}, {\vec{s}}^{(n)}) \forall {\vec{s}}^{(n + 1)} \geq 0, \\ {\vec{s}}^{(n)} \forall {\vec{s}}^{(n + 1)} < 0, \end{cases} \\ {\vec{q}}^{(n + 1)} \Rightarrow Q^{(n + 1)}, {\vec{s}}^{(n + 1)} \Rightarrow S^{(n + 1)}, X^{(n + 1)} = Q^{(n + 1)} S^{(n + 1)} V, \\ E^{(n + 1)} = E (Q^{(n + 1)}, S^{(n + 1)} |V) = \sum_{i = 1}^{m} \sum_{j = 1}^{r} x_{i j}^{(n + 1)} \ln \frac{x_{i j}^{(n + 1)}}{v_{i j}}, \end{matrix}

(18)

where parameters

γ_{\vec{q}}

,

γ_{\vec{s}}

regulate increments in the corresponding dimension.

Iterative process (18) ends when the dynamics of the change in the value of the relative entropy functional becomes less than the threshold δ:

δ_{E} = E^{(n + 1)} - E^{(n)} = \frac{I (V) - I (Y (Q |V))}{I (V)} \leq δ,

(19)

where

I (V) = \sum_{i = 1}^{m} \sum_{j = 1}^{n} v_{i j} \ln v_{i j}

is the information capacity of the positive matrix

V_{(m \times n)}

. By analogy, we write:

I (Y (Q |V)) = \sum_{i = 1}^{m} \sum_{j = 1}^{r} y_{i j} (Q |V) \ln y_{i j} (Q |V)

, where

y_{i j} = \sum_{l = 1}^{n} v_{i l} q_{l j}

.

The computational complexity of the implementation of the iterative procedure just described increases nonlinearly with the increase in the dimension of the analyzed empirical matrices. Considering this circumstance, it is acceptable to define the elements of the matrix of the reduced dimension Q based on the approximately defined relative entropy functional

\tilde{E}

. For example, let us use the approximation of the logarithmic function at the point

x_{0} = w

:

\ln x < \ln w + (x - w) / w_{\min}

. For points

w = x_{i j}

we find:

E (Q, S |V) \approx \tilde{E} (Q, S |V) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} (x_{i j}^{2} (Q, S |V) - x_{i j} (Q, S |V) v_{i j}) .

(20)

Let us present the expression (20) in the matrix form:

\tilde{E} (Q, S |V) = Sp (X X^{T}) - Sp (X V^{T}), = † (X (Q, S), X (Q, S)) - † (X (Q, S), V)

(21)

where the symbol

†

represents the Frobenius scalar product:

Sp (A B^{T}) = Sp (B A^{T}) = † (A, B) = † (B, A)

.

With a fixed matrix of empirical data V, we will minimize the functional

\tilde{E} (Q, S |V)

on the set of positive matrices Q and S:

({\tilde{Q}}^{*}, {\tilde{S}}^{*}) = \arg \min_{(Q, S) \geq 0} \tilde{E} (Q, S |V) .

(22)

The procedure for finding

({\tilde{Q}}^{*}, {\tilde{S}}^{*})

also uses components (16), and (17), which should be adapted to the scalar form of representation of the entities involved. Applying the rules of matrix differentiation to the functional (21), we obtain the following scalar interpretations of components (16), and (17):

Δ_{Q} (Q, S) = \frac{\partial \tilde{E} (Q, S |V)}{\partial X} \frac{\partial X}{\partial Q} = 2 S X Q X - S X,

(23)

Δ_{S} (Q, S) = \frac{\partial \tilde{E} (Q, S |V)}{\partial X} \frac{\partial X}{\partial S} = 2 Q^{T} X Q X - Q^{T} X,

(24)

where

X = X X^{T}

;

Δ_{Q} (Q, S)

and

Δ_{S} (Q, S)

are the gradients of matrices Q and S, respectively. The results of expressions (23), and (24) will be matrices of dimension (n × r).

We initialize the iterative procedure for finding the extremum of objective function (22) based on the gradient descent method and in terms of entities (23), and (24).

For the 0th iteration: we take X⁽⁰⁾, V⁽⁰⁾.

For the nth iteration, we write:

\begin{matrix} Q^{(n + 1)} = \{\begin{cases} Q^{(n)} + γ_{Q} Δ_{Q} \tilde{E} (Q^{(n)}, S^{(n)} |V) \geq 0 \forall Q^{(n + 1)} \geq 0, \\ Q^{(n)} \forall Q^{(n + 1)} < 0, \end{cases} \\ S^{(n + 1)} = \{\begin{cases} S^{(n)} + γ_{S} Δ_{S} \tilde{E} (Q^{(n)}, S^{(n)} |V) \geq 0 \forall S^{(n + 1)} \geq 0, \\ S^{(n)} \forall S^{(n + 1)} < 0, \end{cases} \\ X^{(n + 1)} = V Q^{(n + 1)} S^{(n + 1)}, E^{(n + 1)} = \sum_{i = 1}^{m} \sum_{j = 1}^{r} x_{i j}^{(n + 1)} \ln \frac{x_{i j}^{(n + 1)}}{v_{i j}} . \end{matrix}

(25)

The iterative process (25) ends when the dynamics of the change in the value of the functional

\tilde{E} (Q, S |V)

becomes less than the set threshold δ:

E^{(n + 1)} - E^{(n)} \leq δ

.

In [35], the authors described the basic concept of solving the d- and c-problems mentioned in Section 2.1 for empirical data of the type V and Y, respectively. The result is the optimal probability distribution densities of characteristic parameters and interference (for the d-problem:

P^{*} (w)

,

L^{*} (ε)

, and for the c-problem:

A^{*} (a)

,

Z^{*} (ξ)

, respectively). The mathematical apparatus presented in Section 2.2 allows, based on linear models (2), and (4), to calculate normalized

U \cap S

probability distributions

F_{d} (\vec{u})

and

F_{c} (\vec{s})

to determine the absolute difference between these functions in terms of relative entropy [38,39,40,41].

To preserve the integrity of the presentation of the material, we will demonstrate how the basic concept of solving the d-problem is implemented in the context of model (2). Let’s define the functional

E (P (w), L (ε))

on the probability distribution densities

P^{*} (w)

and

L^{*} (ε)

. We need to solve the optimization problem with the following objective function and constraints:

\begin{array}{l} E (P (w), L (ε)) = - \int_{W} P (w) \ln P (w) d w - \int_{E} L (ε) \ln L (ε) d ε \to \max \\ \int_{W} P (w) d w = 1, \int_{E} L (ε) d ε = 1, M \{z\} = \int_{W} V w P (w) d w + \int_{E} ε L (ε) d ε = y . \end{array}

(26)

The solution to the optimization problem (26) in analytical form looks like

P^{*} (w) = \exp (- θ, V w) / \int_{W} \exp (- θ, V w) d w, L^{*} (ε) = \exp (- θ, ε) / \int_{Β} \exp (- θ, ε) d ε,

where the Lagrange multipliers θ are determined as a result of solving the system of balance equations

M \{z\}

in the interpretation

\int_{W} V w P^{*} (w) d w + \int_{E} ε L^{*} (ε) d ε = y

.

In the context of the model (2), the probability distribution density F(u) of the observation vector u is defined as

F (u) = \int_{E} Π (u - ε) L^{*} (ε) d ε = F_{d} (u),

(27)

where F_d(u) is the desired probability distribution density of the d-problem model, and

Π (u - ε)

is the density of the stochastic vector

u - ε

. From expression (27) we find

w = V^{T} z / V^{T} V

.

Considering the interval nature of the vector z:

z \in Z = [z^{-} = V w^{-}, z^{+} = V w^{+}]

, we write

η (z) = P^{*} (V^{T} z / V^{T} V)

. Having normalized the function

η (z)

, we express the probability distribution density of the vector z as

Π (z) = η (z) / \int_{Z} η (z) d z

.

To determine the probability distribution density

F_{c} (s)

in the context of the model (4) (c-problem), it is necessary to repeat the sequence of actions embodied in expression (27) based on the empirical data matrix Y.

To compare the functions

F_{d} (u)

and

F_{c} (s)

, it is necessary to normalize them on the common carrier

Λ = U \cap S

:

{\tilde{F}}_{d} (λ) = F_{d} (λ) / \int_{Λ} F_{d} (λ) d λ, {\tilde{F}}_{c} (λ) = F_{c} (λ) / \int_{Λ} F_{c} (λ) d λ .

(28)

To find the absolute share of information losses between functions

{\tilde{F}}_{d} (λ)

and

{\tilde{F}}_{c} (λ)

due to compaction Δ_E we define in terms of the relative entropy of RE as

R E ({\tilde{F}}_{d}, {\tilde{F}}_{c}) = \int_{Λ} {\tilde{F}}_{c} (λ) \ln ({\tilde{F}}_{c} (λ) / {\tilde{F}}_{d} (λ)),

Δ_{E} = \frac{1}{2} R E ({\tilde{F}}_{d}, {\tilde{F}}_{c}) + R E ({\tilde{F}}_{d}, {\tilde{F}}_{c}) .

(29)

Note that the minimum Δ_E = 0 is reached at

{\tilde{F}}_{d} (λ) = {\tilde{F}}_{c} (λ)

.

3. Results

Let us begin the experimental Section by demonstrating the functionality of the mathematical apparatus proposed in Section 2.2 on a simple abstract example.

Suppose we have initial empirical data of the form

V_{(m = 2 \times n = 2)} = (\begin{matrix} 0, 100 & 0, 800 \\ 0, 800 & 1, 000 \end{matrix})

. In the context of model (2), we write

u = V w + ε

. Suppose that

w \in W = [0, 000; 5, 000]

,

ε \in E = [- 0, 500; 0, 500]

. The output component is defined by the vector

y = (0, 600; 1, 400)

.

Let

r = 1

, then

Y_{(2 \times 1)} = V_{(2 \times 2)} Q_{(2 \times 1)}

, where

Q_{(2 \times 1)} = (\begin{matrix} q_{11} \\ q_{21} \end{matrix})

is the matrix for the direct projection. The compactification model (4) for the above values and conditions looks like this

s = Y a + ξ

, where

a \in a = [0, 000; 5, 000]

,

ξ \in Ξ = [- 0, 500; 0, 500]

. The inverse projection operation is analytically characterized as

X_{(2 \times 2)} = V_{(2 \times 2)} Q_{(2 \times 1)} S_{(1 \times 2)}

, where

S_{(1 \times 2)} = (\begin{matrix} s_{11} & s_{12} \end{matrix})

is the matrix for the inverse projection.

Our example is characterized by a small dimension, so we will use procedure (18) to determine the cross entropy. In this context, the cross entropy E between the original empirical matrix

V_{(2 \times 2)}

and the matrix

X_{(2 \times 2)}

obtained as a result of direct-inverse projection will be analytically determined by the expression

E = - \sum_{i = 1}^{2} \sum_{j = 1}^{2} x_{i j} \ln \frac{x_{i j}}{v_{i j}}

. The function

E (Q, S)

reaches an extremum at

Q_{\max}^{*} = (\begin{matrix} 0, 356 \\ 0, 768 \end{matrix})

,

S_{\max}^{*} = (\begin{matrix} 0, 257 & 0, 559 \end{matrix})

. Accordingly, the optimal compactified matrix

Y

has the form

Y^{*} = (\begin{matrix} 0, 356 \\ 0, 768 \end{matrix})

.

The optimal probability distribution densities of the characteristic parameters

w

and interference

ε

for the matrix V defined at the beginning of the Section are characterized by the functions

P^{*} (w) = 1, 221 \exp (- 0.888 w_{1} - 1, 419 w_{2})

and

L^{*} (ε) = 0, 982 \exp (- 0, 642 ε_{1} - 0, 136 ε_{2})

. To compare the functions

F_{d} (u)

and

F_{c} (s)

, it is necessary to normalize them on the common carrier, so, using (28), we find

0 \leq λ_{1} \leq 1, 778

,

0 \leq λ_{2} \leq 3, 842

. Then, with the defined functions (2), (4), and

P^{*} (w)

, the absolute share of information losses (29) of reducing the dimensionality of the space of characteristic parameters from n = 2 to r = 1

(V_{(2 \times 2)} \to Y_{(2 \times 1)}^{*})

is equal to

Δ_{E} = 0, 245

, which allows us to consider the result of the proposed compactification procedure of the original empirical matrix V as adequate.

To prove the effectiveness of the proposed compactification method (18) (Met3), the method should be compared with popular analogues, namely, with the principal component analysis method (Met1) and the random projection method (Met2). Considering the linear nature of functions (2) and (4), we will experiment in the context of solving the verification problem (dichotomous classification) with a linear classifier. Let’s formulate such a problem based on the terminology used.

We define the linear classifier model as

z (s_{k}) = sign (\sum_{i = 1}^{n} w_{i} v_{i} (s_{k})) = \{\begin{cases} + 1 \forall \sum_{i = 1}^{n} w_{i} v_{i} (s_{k}) \geq 0, \\ - 1 \forall \sum_{i = 1}^{n} w_{i} v_{i} (s_{k}) < 0, \end{cases}

(30)

where

k \in \{1, m\}

and the values of the weights

w \in R^{n}

are unknown a priori.

Empirical data with the structure

〈V_{(m \times n)}, y_{(m \times 1)}〉

are available, and

y_{k} = \{\begin{cases} + 1 \forall z (t_{k}) = + 1, \\ - 1 \forall z (t_{k}) = - 1, \end{cases}

where

t_{k}

is an instance of a class

〈V, y〉

with a number

k \in \{1, m\}

. The training of the classifier (30) is reduced to the minimization of the empirical risk function of the form

R (w) = \sum_{i = 1}^{m} {‖y - z (w |V)‖}^{2}

. To test the trained classifier (30), test empirical data with the structure

〈U_{(l \times n)}, x_{(l \times 1)}〉

were used.

The results of the classification

b (t_{k}) = s i g n (\sum_{i = 1}^{n} {\hat{w}}_{i} u_{i} (t_{k})) = \{- 1, 1\} \forall k = \bar{1, l}

are synchronously compared with the corresponding elements of the vector

x

and taken into account in the form of the value of the function

I = \sum_{k - 1}^{l} Δ (t_{k})

, where

Δ (t_{k}) = \{\begin{cases} 1 \forall b (t_{k}) = x (t_{k}), \\ 0 \forall b (t_{k}) \neq x (t_{k}) . \end{cases}

. Accordingly, classification accuracy is defined as

α = I / l

.

The conducted experiment consisted of solving the verification problem using classifier (30) for:

e0—basic empirical dataset

〈V_{(m \times n)}, y_{(m \times 1)}〉 + 〈U_{(l \times n)}, x_{(l \times 1)}〉

;

{e1, e2, e3}—the dataset

〈V, y〉 + 〈U, x〉

, the dimension of the attributes of the matrices V and U which underwent compactification from the initial n to the specified r elements by the method {Met1, Met2, Met3}.

The value

r

was iteratively reduced:

r = n - 1, n - 2, \dots, 1

, forming a set of datasets at each of the stages {e1, e2} with the corresponding compactification degree. The number of compactification procedures e3 was determined by the set of threshold values (19).

For experiments, as necessary, tables of synthetic data of the required size were generated. For this, the sklearn.datasets.make_classification(n_class = 2, n_clusters = 2, n_redundant = 0, class_sep = 1.0, n_informative = {10, 15}) function of the Python programming language was used. Before use, all generated data were normalized to fall within the unit interval [0, 1]. The experiments were carried out using scipy.stats.bootstrap cross-validation.

The algorithmic designs of the {Met1, Met2, Met3} methods were implemented by the functions of the scipy and sklearn libraries. The classifier (30) was implemented as a support vector machine with a linear kernel using the function sklearn.svm.SVC. The Met1 Met2 methods were implemented using the sklearn.decomposition.PCA and sklearn.random_projection.GaussianRandomProjection functions, respectively. The basis for the implementation of the author’s method (18) was the scipy.optimize.minimize function (after inverting the objective function (15)). At the same time, the attribute ftol was considered to be related to the threshold (19).

As already mentioned in Section 2.2, the author’s empirical data compactification method proposed in the form of procedure (18) is comparatively computationally complex (this is what prompted the authors to formalize the “simplified” iterative procedure (25)). However, Met1, Met2 analogues have their disadvantages, which appear when compacting large-dimensional data. For example, with a sufficiently large number of instances of data m and their heterogeneity, Met1 becomes unstable. We will conduct the first experiment of the form α = f(m, Met, r) for

m = \bar{5, 10}

,

n = 10

,

r = \bar{10, 5}

, Met = {Met1, Met2, Met3}. The obtained results are visualized in Figure 1.

The previous experiment characterized the ultra-compact empirical data compactification procedure:

m \approx n

,

n / 2 \leq r \leq n

. Now, let us investigate how the verification accuracy α depends on the compactification of the initial data, for which

m > > r

,

m > n

. The experiments were carried out for two generated datasets DS1 and DS2. The first was characterized by dimension (m = 100, n = 10) and the second by dimension (10⁴, 10²). When processing the first dataset, we set r = {100, 90,…, 50} When working with the second dataset, we set r = {100, 90,…, 50} The obtained results are presented in Figure 2.

The following experiment is specific to Met3 because it concerns the detection of the dependence between the verification accuracy α and the dynamics of such parameters as the compactification degree r and the value of the threshold

δ = \{0, 5; 0, 4; \dots; 0, 1\}

(see expression (19)). To preserve the common information background, the remaining parameters were borrowed from the previous experiment without changes, namely:

D S = \{D S 1, D S 2\}

, r(DS1) = {10, 9,…, 5}, and r(DS2) = {100, 90,…, 50}. The resulting dependencies are visualized in Figure 3.

The empirical data compactification process is accompanied by an information loss. The absolute error as an indicator of information loss during compactification can be calculated by expression (29). The relative share of information losses during compactification can be calculated directly by expression (19) when implementing the compactification procedure (18). Figure 4 presents the calculated dependences of the relative share of information loss

δ_{E}

on the compactification method

M e t = \{M e t 1, M e t 2, M e t 3\}

for datasets

\{D S 1, D S 2\}

with the corresponding ranges of changes in the compaction degree

r

.

Finally, we will conclude the Experimental Section with a study of Met3, the detection of the dependence between the relative share of information loss

δ_{E}

, and the dynamics of such parameters as the compactification degree

r

and the threshold value

δ = \{0, 5; 0, 4; \dots; 0, 1\}

(see expression (19)). To ensure a holistic perception of the material of the Section, the remaining parameters were borrowed from the previous experiment. The resulting dependencies are shown in Figure 5.

4. Discussion

The research subject was chosen to reveal the characteristic features of the research object. This axiom works in all areas of science. Data analysis is no exception. There can be a huge, large, or small amount of data. The case with a small amount of data may be complicated by the fact that the source of the data, the process of its collection, or both, may not be under the researchers’ control. In this case, data scientists will have to work with small stochastic data. The mathematical apparatus presented in Section 2.2 is focused on the problem of analyzing such data. Objective functions (15), and (22) implement the principle of maximum entropy formulated by Willard Gibbs in the context of compactification of (small) stochastic empirical data. Gibbs’ work says that the most characteristic probability distributions of the states of an uncertain object are distributions that maximize the chosen measure of uncertainty, taking into account the available reliable information about the investigated object. The effectiveness of this approach is demonstrated by the results presented in Figure 1. Recall that, in this experiment, the compactification of extremely small data was carried out (the number of instances

m

in the data collection approached the number of attributes n). From Figure 1a,b, it can be seen that both the principal component analysis method (Met1) and the random projection method (Met2) demonstrated cases of non-functionality in situations when m < r, where r was the desired number of attributes in the compactified collection. The author’s method (Met3) remained functional under any requirements determined by the experiment.

As shown in Figure 1, the results characterized the small empirical data compactification process: m ≈ n,

n / 2 \leq r \leq n

, then the results presented in Figure 2 show how the verification accuracy α depends on the compactification of the initial data, for which

m > n

(a sufficient amount of empirical data, Figure 2a) or

m > > r

(“big” empirical data, Figure 2b). From Figure 2a, it can be seen that

r \leq 7

of function

α (M e t 3)

shows a monotonic linear character, in contrast to functions

α (M e t 1)

and

α (M e t 2)

. This circumstance indicates that it was the author’s method that made it possible to find the optimal configuration of the characteristic parameters space. Instead, the change of

r

in all functions

α (M e t)

from Figure 2b is characterized by a non-linear character. It can also be seen that, with

r \leq 60

, it is the author’s compactification method

M e t 3

that generates the least informative parametric space in comparison with analogues. This fact can be explained by the fact that optimization method (18) does not have time to come close to the optimal distribution ensemble for the maximum number of iterations set of the algorithm (attribute

m a x i t e r = 1000

for the function

s c i p y . o p t i m i z e . m i n i m i z e

). The way out in such a situation can be the application of the approximate version of algorithm (18), represented by expressions (25).

Figure 3 demonstrates the dependence of the verification accuracy α on the dynamics of such parameters as the compactification degree r and the threshold value

δ = \{0, 5; 0, 4; \dots; 0, 1\}

(see expression (19)) of the completion of the iterative procedure (18). Let us notice that threshold

δ

is also a parameter that determines the maximum allowable reduction of the information capacity for the compactification data matrix. The usefulness of parameter

δ

lies in the fact that, based on its value, we can choose the permissible compaction degree

r

, not empirically (as, for example, in

M e t 1

) but analytically; if, after reducing the dimensionality of the dimension of the characteristic parameters to the value

r^{(n)}

, the estimate

δ_{E}

has decreased too much, then the compactification process should be stopped and the algorithm should be rolled back to the previous value of

r^{(n - 1)}

. This is exactly the behaviour we observe in Figure 3a. Instead, as shown in Figure 3b, the situation is not stable. The probable explanation for this is similar to the one we mentioned regarding Figure 2b.

Figure 4 presents the calculated dependences of the relative share of information loss

δ_{E}

on the compactification method

M e t = \{M e t 1, M e t 2, M e t 3\}

for datasets

\{D S 1, D S 2\}

with the corresponding ranges of changes in the compactification degree r. It can be seen that it is the function

δ_{E} = f (r, M e t 3)

with the growth r that grows significantly more slowly, surpassing competitors by almost two times. Note that this advantage was observed both for the “large” dataset DS1 and for the “Big” dataset DS2.

Figure 5 shows the relationship formalized by expression (19) between the relative share of information loss

δ_{E}

and the dynamics of such parameters as the compactification degree r and the threshold value

δ

. It is interesting that, for the dataset DS1 (Figure 5a), all values of r the condition

δ_{E} \leq δ

are fulfilled, that is, algorithm (18) managed to find optimal distributions without exceeding the set limit on the permissible number of iterations. On the other hand, the circumstances were different for the “Big” dataset DS2. This can explain the unstable nature of the values presented in Figure 5b.

In general, the results presented in Section 3 prove both the functionality and the effectiveness of the mathematical apparatus presented in Section 2 in comparison with classical analogues, namely, the principal component analysis method and the random projection method. The obvious advantage of the author’s method is the demonstrated stability of the small stochastic data compactification process and the possibility of analytical control of the loss of information capacity of the compactification data matrix. On the other hand, the disadvantage of the author’s method is the computational complexity, which is especially evident when processing large data matrices. However, to mitigate this limitation, the authors propose an approximating simplified version (25) of the basic compactification procedure (18).

To implement the cross-entropy version of the author’s compactification method, the method of conditional optimization on a non-negative orthant (CONNO) is adapted, and implemented in the scipy library. We note that, for some combinations of input data, the basic version of the CONNO method does not find a solution for the given optimization parameters. To test this concept, a series of experiments were adopted. The first series of experiments focused on identifying the dependence of classification accuracy on the number of objects (i.e., sample size). The study of this dependence for three compactification methods (PCA, RP, and author’s) is important to identify areas of their application. It is known that entropy maximization methods and their derivatives, in particular the author’s method, are usually used when the amount of data is limited compared to the dimension of the feature space. With “Big Data,” there are no fundamental restrictions on their use, but computational difficulties increase significantly. The next series of experiments was focused on identifying the dependence of classification accuracy in conditions where the number of measurements significantly exceeds the number of characteristic parameters. The next series of experiments was focused on identifying the dependence of classification accuracy for the author’s method on the acceptable reduction in the information capacity of the dataset. The next series of experiments was focused on assessing information losses from compactification implemented using and for the author’s method. The experiments described above have already been carried out and results that positively characterize the author’s method have been obtained. The problem is that, in its final form, the description, results obtained, and discussion are already more than 10 pages long. Increasing the size of this (already massive) article does not seem practical; therefore, if the mentioned experimental results interest you, dear reader, then I ask you to contact the corresponding author and he will be happy to share with you the results mentioned above.

5. Conclusions

Measurement is a typical way of gathering information about the investigated object, generalized by a finite set of characteristic parameters. The result of each iteration of the measurement is an instance of the class of the investigated object in the form of a set of values of characteristic parameters. An ordered set of instances forms a collection whose dimensionality for a real object is a factor that cannot be ignored. Managing the dimensionality of data collection, as well as classification, regression, and clustering, are fundamental problems of machine learning.

Compactification is the approximation of the original data collection by an equivalent collection (with a reduced dimension of characteristic parameters) with the control of accompanying information capacity losses. Related to compactification is the data completeness verifying procedure, which is characteristic of the data reliability assessment. If there are stochastic parameters among the initial data collection characteristic parameters, the compactification procedure becomes more complicated. To take this into account, the research proposes a model of a structured collection of stochastic data defined in terms of relative entropy. The compactification of such a data model is formalized by an iterative procedure aimed at maximizing the relative entropy of sequential implementation of direct and reverse projections of data collections, taking into account the estimates of the probability distribution densities of their attributes. The procedure for approximating the relative entropy function of compactification to reduce the computational complexity of the latter is proposed. For a qualitative assessment of compactification, the metric of such indicators as the data collection information capacity, and the absolute and relative share of information losses due to compaction, are analytically formalized. Taking into account the semantic connection of compactification and completeness, the proposed metric is also relevant for the data reliability assessment task. Testing the proposed compactification procedure proved both its stability and efficiency in comparison with such used analogues as the principal component analysis method and the random projection method.

Further research is planned to attempt to simplify the procedure for finding entropy-optimal matrix projectors while observing the limit on permissible information losses from compactification.

Author Contributions

Conceptualization, V.K.; methodology, V.K.; software, V.K.; validation, E.Z., V.L., K.G. and O.K.; formal analysis, V.K.; investigation, V.K.; resources, E.Z., V.L., K.G. and O.K.; data curation, E.Z., V.L., K.G. and O.K.; writing—original draft preparation, V.K.; writing—review and editing, V.K., E.Z., V.L., K.G. and O.K.; visualization, V.K.; supervision, V.K.; project administration, V.K.; funding acquisition, V.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research is part of the project No. 2022/45/P/ST7/03450 co-funded by the National Science Centre and the European Union Framework Programme for Research and Innovation Horizon 2020 under the Marie Skłodowska-Curie grant agreement No. 945339. For the purpose of Open Access, the author has applied a CC-BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Most data are contained within the article. All the data are available on request due to restrictions, e.g., privacy or ethics.

Acknowledgments

The authors are grateful to all colleagues and institutions that contributed to the research and made it possible to publish its results.

Conflicts of Interest

The authors declare no conflict of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Biswas, P.; Dandapat, S.K.; Sairam, A.S. Ripple: An approach to locate k nearest neighbours for location-based services. Inf. Syst. 2022, 105, 101933. [Google Scholar] [CrossRef]
Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decis. Anal. J. 2022, 3, 100071. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Dronyuk, I.; Tkachenko, P.; Gregus, M.; Rashkevych, M. Predictive modeling based on small data in clinical medicine: RBF-based additive input-doubling method. Math. Biosci. Eng. 2021, 18, 2599–2613. [Google Scholar] [CrossRef] [PubMed]
Izonin, I.; Tkachenko, R.; Shakhovska, N.; Lotoshynska, N. The Additive Input-Doubling Method Based on the SVR with Nonlinear Kernels: Small Data Approach. Symmetry 2021, 13, 612. [Google Scholar] [CrossRef]
Kamm, S.; Veekati, S.S.; Müller, T.; Jazdi, N.; Weyrich, M. A survey on machine learning based analysis of heterogeneous data in industrial automation. Comput. Ind. 2023, 149, 103930. [Google Scholar] [CrossRef]
Tymchenko, O.; Havrysh, B.; Tymchenko, O.O.; Khamula, O.; Kovalskyi, B.; Havrysh, K. Person Voice Recognition Methods. In Proceedings of the 2020 IEEE Third International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Bisikalo, O.; Kovtun, O.; Kovtun, V.; Vysotska, V. Research of Pareto-Optimal Schemes of Control of Availability of the Information System for Critical Use. In Proceedings of the 2020 1st International Workshop on Intelligent Information Technologies & Systems of Information Security (IntelITSIS), Khmelnytskyi, Ukraine, 10–12 June 2020; CEUR-WS. Volume 2623, pp. 174–193. [Google Scholar]
Bisikalo, O.V.; Kovtun, V.V.; Kovtun, O.V.; Danylchuk, O.M. Mathematical Modeling of the Availability of the Information System for Critical Use to Optimize Control of its Communication Capabilities. Int. J. Sens. Wirel. Commun. Control. 2021, 11, 505–517. [Google Scholar] [CrossRef]
Bisikalo, O.; Danylchuk, O.; Kovtun, V.; Kovtun, O.; Nikitenko, O.; Vysotska, V. Modeling of Operation of Information System for Critical Use in the Conditions of Influence of a Complex Certain Negative Factor. Int. J. Control. Autom. Syst. 2022, 20, 1904–1913. [Google Scholar] [CrossRef]
Bisikalo, O.; Bogach, I.; Sholota, V. The Method of Modelling the Mechanism of Random Access Memory of System for Natural Language Processing. In Proceedings of the 2020 IEEE 15th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, NJ, USA, 25–29 February 2020; IEEE: Piscataway, NJ, USA. [Google Scholar] [CrossRef]
Mochurad, L.; Horun, P. Improvement Technologies for Data Imputation in Bioinformatics. Technologies 2023, 11, 154. [Google Scholar] [CrossRef]
Stankevich, S.; Kozlova, A.; Zaitseva, E.; Levashenko, V. Multivariate Risk Assessment of Land Degradation by Remotely Sensed Data. In Proceedings of the 2023 International Conference on Information and Digital Technologies (IDT), Zilina, Slovakia, 20–22 June 2023. [Google Scholar] [CrossRef]
Kharchenko, V.; Illiashenko, O.; Fesenko, H.; Babeshko, I. AI Cybersecurity Assurance for Autonomous Transport Systems: Scenario, Model, and IMECA-Based Analysis. In Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2022; pp. 66–79. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Krak, I.; Berezsky, O.; Shevchuk, I.; Shandilya, S.K. A cascade ensemble-learning model for the deployment at the edge: Case on missing IoT data recovery in environmental monitoring systems. Front. Environ. Sci. 2023, 11, 1295526. [Google Scholar] [CrossRef]
Auzinger, W.; Obelovska, K.; Dronyuk, I.; Pelekh, K.; Stolyarchuk, R. A Continuous Model for States in CSMA/CA-Based Wireless Local Networks Derived from State Transition Diagrams. In Proceedings of International Conference on Data Science and Applications; Springer: Singapore, 2021; pp. 571–579. [Google Scholar] [CrossRef]
Deng, P.; Li, T.; Wang, D.; Wang, H.; Peng, H.; Horng, S.-J. Multi-view clustering guided by unconstrained non-negative matrix factorization. Knowl.-Based Syst. 2023, 266, 110425. [Google Scholar] [CrossRef]
De Handschutter, P.; Gillis, N.; Siebert, X. A survey on deep matrix factorizations. Comput. Sci. Rev. 2021, 42, 100423. [Google Scholar] [CrossRef]
De Clercq, M.; Stock, M.; De Baets, B.; Waegeman, W. Data-driven recipe completion using machine learning methods. Trends Food Sci. Technol. 2016, 49, 1–13. [Google Scholar] [CrossRef]
Shu, L.; Lu, F.; Chen, Y. Robust forecasting with scaled independent component analysis. Finance Res. Lett. 2023, 51, 103399. [Google Scholar] [CrossRef]
Moneta, A.; Pallante, G. Identification of Structural VAR Models via Independent Component Analysis: A Performance Evaluation Study. J. Econ. Dyn. Control. 2022, 144, 104530. [Google Scholar] [CrossRef]
Zhang, R.; Dai, H. Independent component analysis-based arbitrary polynomial chaos method for stochastic analysis of structures under limited observations. Mech. Syst. Signal Process. 2022, 173, 109026. [Google Scholar] [CrossRef]
HLi, H.; Yin, S. Single-pass randomized algorithms for LU decomposition. Linear Algebra its Appl. 2020, 595, 101–122. [Google Scholar] [CrossRef]
Iwao, S. Free fermions and Schur expansions of multi-Schur functions. J. Comb. Theory Ser. A 2023, 198, 105767. [Google Scholar] [CrossRef]
Terao, T.; Ozaki, K.; Ogita, T. LU-Cholesky QR algorithms for thin QR decomposition. Parallel Comput. 2020, 92, 102571. [Google Scholar] [CrossRef]
Trendafilov, N.; Hirose, K. Exploratory factor analysis. In International Encyclopedia of Education, 4th ed.; Elsevier: Amsterdam, The Netherlands, 2023; pp. 600–606. [Google Scholar] [CrossRef]
Fu, Z.; Xi, Q.; Gu, Y.; Li, J.; Qu, W.; Sun, L.; Wei, X.; Wang, F.; Lin, J.; Li, W.; et al. Singular boundary method: A review and computer implementation aspects. Eng. Anal. Bound. Elements 2023, 147, 231–266. [Google Scholar] [CrossRef]
Roy, A.; Chakraborty, S. Support vector machine in structural reliability analysis: A review. Reliab. Eng. Syst. Saf. 2023, 233, 109126. [Google Scholar] [CrossRef]
Çomak, E.; Arslan, A. A new training method for support vector machines: Clustering k-NN support vector machines. Expert Syst. Appl. 2008, 35, 564–568. [Google Scholar] [CrossRef]
Chen, H.L.; Yang, B.; Wang, S.J.; Wang, G.; Liu, D.Y.; Li, H.Z.; Liu, W.B. Towards an optimal support vector machine classifier using a parallel particle swarm optimization strategy. Appl. Math. Comput. 2014, 239, 180–197. [Google Scholar] [CrossRef]
Pineda, S.; Morales, J.M.; Wogrin, S. Mathematical programming for power systems. In Encyclopedia of Electrical and Electronic Power Engineering; Elsevier: Amsterdam, The Netherlands, 2023; pp. 722–733. [Google Scholar] [CrossRef]
Li, P.; Pei, Y.; Li, J. A comprehensive survey on design and application of autoencoder in deep learning. Appl. Soft Comput. 2023, 138, 110176. [Google Scholar] [CrossRef]
Mishra, D.; Singh, S.K.; Singh, R.K. Deep Architectures for Image Compression: A Critical Review. Signal Process. 2022, 191, 108346. [Google Scholar] [CrossRef]
Zheng, J.; Qu, H.; Li, Z.; Li, L.; Tang, X. A deep hypersphere approach to high-dimensional anomaly detection. Appl. Soft Comput. 2022, 125, 109146. [Google Scholar] [CrossRef]
Costa, M.C.; Macedo, P.; Cruz, J.P. Neagging: An Aggregation Procedure Based on Normalized Entropy. In Proceedings of the International Conference Of Numerical Analysis And Applied Mathematics ICNAAM 2020, Crete, Greece, 19–25 September 2022. [Google Scholar] [CrossRef]
Bisikalo, O.; Kharchenko, V.; Kovtun, V.; Krak, I.; Pavlov, S. Parameterization of the Stochastic Model for Evaluating Variable Small Data in the Shannon Entropy Basis. Entropy 2023, 25, 184. [Google Scholar] [CrossRef]
Zeng, Z.; Ma, F. An efficient gradient projection method for structural topology optimization. Adv. Eng. Softw. 2020, 149, 102863. [Google Scholar] [CrossRef]
El Masri, M.; Morio, J.; Simatos, F. Improvement of the cross-entropy method in high dimension for failure probability estimation through a one-dimensional projection without gradient estimation. Reliab. Eng. Syst. Saf. 2021, 216, 107991. [Google Scholar] [CrossRef]
Liu, B.; Chai, Y.; Huang, C.; Fang, X.; Tang, Q.; Wang, Y. Industrial process monitoring based on optimal active relative entropy components. Measurement 2022, 197, 111160. [Google Scholar] [CrossRef]
Fujii, M.; Seo, Y. Matrix trace inequalities related to the Tsallis relative entropies of real order. J. Math. Anal. Appl. 2021, 498, 124877. [Google Scholar] [CrossRef]
Makarichev, V.; Kharchenko, V. Application of Dynamic Programming Approach to Computation of Atomic Functions. In Radioelectronic and Computer Systems; no. 4; National Aerospace University-Kharkiv Aviation Institute: Kharkiv, Ukraine, 2021; pp. 36–45. [Google Scholar] [CrossRef]
Dotsenko, S.; Illiashenko, O.; Kharchenko, V.; Morozova, O. Integrated Information Model of an Enterprise and Cybersecurity Management System. Int. J. Cyber Warf. Terror. 2022, 12, 1–21. [Google Scholar] [CrossRef]

Figure 1. (a) Dependence α = f(m, Met1, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

. (b) Dependence α = f(m, Met2, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

. (c) Dependence α = f(m, Met3, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

.

Figure 1. (a) Dependence α = f(m, Met1, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

. (b) Dependence α = f(m, Met2, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

. (c) Dependence α = f(m, Met3, r),

m = \bar{5, 10}

,

r = \bar{10, 5}

.

Figure 2. (a) Dependence α = f(Met, r, DS1),

r = \bar{10, 5}

, Met = {Met1, Met2, Met3}. (b) Dependence α = f(Met, r, DS2), r = {100, 90,…, 50}, Met = {Met1, Met2, Met3}.

Figure 2. (a) Dependence α = f(Met, r, DS1),

r = \bar{10, 5}

, Met = {Met1, Met2, Met3}. (b) Dependence α = f(Met, r, DS2), r = {100, 90,…, 50}, Met = {Met1, Met2, Met3}.

Figure 3. (a) Dependence α = f(r, δ) for DS1 dataset. (b) Dependence α = f(r, δ) for DS2 dataset.

Figure 4. (a) Dependence

δ_{E} = f (r, M e t)

for

D S 1

dataset. (b) Dependence

δ_{E} = f (r, M e t)

on

D S 2

dataset.

Figure 4. (a) Dependence

δ_{E} = f (r, M e t)

for

D S 1

dataset. (b) Dependence

δ_{E} = f (r, M e t)

on

D S 2

dataset.

Figure 5. (a) Dependence δ_E = f(r, δ) for Met3 and DS1 dataset. (b) Dependence δ_E = f(r, δ) for Met3 and DS2 dataset.

Table 1. General comparison of the concepts of machine and deep learning.

Criterion	Machine Learning	Deep Learning
The number of data points	One can use small amounts of data to create forecasts	It is necessary to use large volumes of training data to create forecasts
Dependence on equipment	It can work on low-power computers. Large computing power is not required	Depends on high-performance computers. At the same time, the computer performs a large number of operations on the matrix. The graphic processor can effectively optimize these operations
The process of constructing features	Requires an accurate determination of the signs and their creation by users	Recognizes high levels based on data and independently creates new signs
Claim to training	The training process is divided into small steps. Then, the results of each step are combined into a single output block	The problem is solved by the method of thorough analysis
Training time	Training takes relatively little time, from a few seconds to several hours	As a rule, the training process takes a long time since the deep learning algorithm includes many levels
Output	The output data is usually a numerical value, for example, assessment or classification	The weekend can have several formats, such as text, estimate or sound

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kovtun, V.; Zaitseva, E.; Levashenko, V.; Grochla, K.; Kovtun, O. Small Stochastic Data Compactification Concept Justified in the Entropy Basis. Entropy 2023, 25, 1567. https://doi.org/10.3390/e25121567

AMA Style

Kovtun V, Zaitseva E, Levashenko V, Grochla K, Kovtun O. Small Stochastic Data Compactification Concept Justified in the Entropy Basis. Entropy. 2023; 25(12):1567. https://doi.org/10.3390/e25121567

Chicago/Turabian Style

Kovtun, Viacheslav, Elena Zaitseva, Vitaly Levashenko, Krzysztof Grochla, and Oksana Kovtun. 2023. "Small Stochastic Data Compactification Concept Justified in the Entropy Basis" Entropy 25, no. 12: 1567. https://doi.org/10.3390/e25121567

APA Style

Kovtun, V., Zaitseva, E., Levashenko, V., Grochla, K., & Kovtun, O. (2023). Small Stochastic Data Compactification Concept Justified in the Entropy Basis. Entropy, 25(12), 1567. https://doi.org/10.3390/e25121567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Small Stochastic Data Compactification Concept Justified in the Entropy Basis

Abstract

1. Introduction

2. Models and Methods

2.1. Statement of the Research

2.2. The Concept of Entropy-Optimal Compactification of Stochastic Empirical Data

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI