Value of Information in the Binary Case and Confusion Matrix †

: The simplest Bayesian system used to illustrate ideas of probability theory is a coin and a boolean utility function. To illustrate ideas of hypothesis testing, estimation or optimal control, one needs to use at least two coins and a confusion matrix accounting for the utilities of four possible outcomes. Here we use such a system to illustrate the main ideas of Stratonovich’s value of information (VoI) theory in the context of a ﬁnancial time-series forecast. We demonstrate how VoI can provide a theoretical upper bound on the accuracy of the forecasts facilitating the analysis and optimization of models.


Introduction
The concept of value of information has different definitions in the literature [1,2]. Here we follow the works of Ruslan Stratonovich and his colleagues, who were inspired by Shannon's work on rate distortion [3] and made a number of important developments in the 1960s [2]. These mainly theoretical results are gaining new interest thanks to the advancements in data science and machine learning and the need for a deeper understanding of the role of information in learning. We shall review the value of information theory in the context of optimal estimation and hypothesis testing, although the context of optimal control is also relevant.
Consider a probability space (Ω, P, A) and a random variable x : Ω → X (a measurable function). The optimal estimation of x ∈ X is the problem of finding an element y ∈ Y maximizing the expected value of some utility function u : X × Y → R (or minimizing for cost −u). The optimal value is where zero designates the fact that no information about the specific value of x ∈ X is given, only the prior distribution P(x). At the other extreme, let z ∈ Z be another random variable that communicates full information about each realization of x. This entails that there is an invertible function z = f (x) such that x = f −1 (z) is determined uniquely by the 'message' z ∈ Z. The corresponding optimal value is U(∞) := E P(x) {sup y(z) u(x, y(z))} , where an optimal y is found for each z (i.e., optimization over all mappings y : Z → Y). In the context of estimation, variable x is the response (i.e., the variable of interest) and z is the predictor. The mapping y(z) represents a model with output y ∈ Y. Phys There are, however, different ways in which the information amount I and the quantity U(I) can be defined, leading to different types of the value function V(I). For example, consider a mapping f : X → Z with a constraint |Z| ≤ e I < |X| on the cardinality of its image. The mapping f partitions its domain into a finite number of subsets f −1 (z) = {x ∈ X : f (x) = z}. Then, given a specific partition z(x), one can find optimal y(z) maximizing the conditional expected utility E P(x|z) {u(x, y) | z} for each subset f −1 (z) x. This optimization should be repeated for different partitions z(x), and the optimal value U(I) is defined over all partitions z(x), satisfying the cardinality constraint ln |Z| ≤ I: Here, P(z) = P{x ∈ f −1 (z)}. The quantity I = ln |Z| is called Hartley's information, and the difference V(I) = U(I) − U(0) in this case is the value of Hartley's information. One can relax the cardinality constraint and replace it with the constraint on entropy One can see from Equation (1) that the computation of the value of Hartley's or Boltzmann's information is quite demanding and may involve a procedure such as the k-means clustering algorithm or training a multilayer neural network. Thus, using these values of information is not practical due to high computational costs. The main result of Stratonovich's theory [4] is that the upper bound on Hartley's or Boltzmann's values of information is given by the value of Shannon's information, and that asymptotically all these values are equivalent (Theorems 11.1 and 11.2 in [4]). The value of Shannon's information is much easier to compute.
Recall the definition of Shannon's mutual information [3]: where W(x, y) = P(x | y)Q(y) is the joint probability distribution on X × Y, and H(X | Y) is the conditional entropy. Under broad assumptions on the reference measures (see Theorem 1.16 in [4]), the following inequalities are valid: The value of Shannon's information is defined using the quantity: The optimization above is over all conditional probabilities P(y | x) (or joint measures W(x, y) = P(y | x)P(x)) satisfying the information constraint I(X, Y) ≤ I. Contrast this with U(I) for Hartley's or Boltzmann's information (1), where optimization is over the mappings y(x) = y • z(x). As was pointed out in [5], the relation between functions (1) and (2) is similar to that between optimal transport problems in the Monge and Kantorovich formulations. Joint distributions optimal in the sense of (2) are found using the standard method of Lagrange multipliers (e.g., see [4,6]): where parameter β −1 , called temperature, is the Lagrange multiplier associated with the constraint I(X, Y) ≤ I. Distributions P and Q are the marginals of W, and function γ(β, x) is defined by normalization ∑ x,y W(x, y; β) = 1. In fact, taking partial traces of solution (3) gives two equations: Equation (5) defines function γ(β, x) = ln ∑ y e β u(x,y) Q(y). If the linear transformation T(·) = ∑ x e β u(x,y) (·) has an inverse, then from Equation (4) one obtains e −γ(β,x) P(x) = T −1 (1) or where γ 0 (β, x) := − ln ∑ y b(x, y), b(x, y) is the kernel of the inverse transformation T −1 , and h(x) = − ln P(x) is random entropy or surprise. Integrating the above with respect to measure P(x) we obtain is the cumulant generating function of optimal distribution (3). Indeed, the expected utility and Shannon's information for this distribution are The first formula can be obtained directly by differentiating Γ(β), and the second by substitution of (3) into the formula for Shannon's mutual information.
The general strategy for computing the value of Shannon's information is to derive the expressions for U(β) and I(β) from function Γ 0 (β) (alternatively, one can obtain U(β −1 ) and I(β −1 ) from free energy F 0 (β −1 ) = −β −1 Γ 0 (β)). Then the dependency U(I) is obtained either parametrically or by excluding β. Let us now apply this to the simplest 2 × 2 case.

Value of Shannon's Information for the 2 × 2 System
Let X × Y = {x 1 , x 2 } × {y 1 , y 2 }, and let u : X × Y → R be the utility function, which we can represent by a 2 × 2 matrix: It is called the confusion matrix in the context of hypothesis testing, where rows correspond to the true states {x 1 , x 2 }, and columns correspond to accepting or rejecting the hypothesis {y 1 , y 2 }. The set of all joint distributions W(x, y) is a 3-simplex (tetrahedron), shown in Figure 1. The 2D surface in the middle is the set of all product distributions W(x, y; 0) = P(x)Q(y), which correspond to the minimum I(X, Y) = 0 of mutual information (independent x, y). With no additional information about x, the decision y 1 to accept or y 2 to reject the hypothesis is completely determined by the utilities and prior probabilities P(x 1 ) = p and P(x 2 ) = 1 − p. Thus, one has to compare expected utili-Phys. Sci. Forum 2022, 5, 8 4 of 9 ties E P {u | y 1 } = p u 11 + (1 − p) u 21 and E P {u | y 2 } = p u 12 + (1 − p) u 22 . The output distribution Q(y) is an elementary δ-distribution: The optimal value corresponding to With c = 1/2 and d = 1/2, the value U(0) = 1 2 + 1 2 |2p − 1| represents the best possible accuracy for prior probabilities P(x) ∈ {p, 1 − p}. If additional information about x is communicated, say by some random variable z ∈ Z, then the maximum possible improvement V(I) = U(I) − U(0) is the value of this information. The first step in deriving function U(I) for the value of Shannon's information (2) is to obtain the expression for function The optimal value corresponding to where det e β u T = e β (u 11 +u 22 ) − e β (u 12 +u 21 ) = 2e β (c 1 +c 2 ) sinh[β (d 1 equations: Writing Equation (4) in the matrix form e β u(x,y) T P(x) e −γ(β,x) = 1 and using the inverse matrix ( e β u(x,y) T ) −1 gives the solution for function e −γ 0 (β,x) = P(x)e −γ(β,x) : ]. This gives two equations: Therefore, the expression for function Γ 0 (β) := pγ 0 (β, Its first derivative Γ 0 (β) gives the expression for U(β): The expression for information is obtained from . Two functions U(β) and I(β) define parametric dependency U(I) for the value of Shannon's information (2).
It is important to note that in the limit β → ∞, corresponding to an increase of information to its maximum, the output probabilities Q(y) ∈ {q, 1 − q} converge to P(x) ∈ {p, 1 − p}.

Application: Accuracy of Time-Series Forecasts
In this section, we illustrate how the value of information can facilitate the analysis of the performance of data-driven models. Here we use financial time-series data and predict the signs of future log returns. Thus, if s(t) and s(t − 1) are prices of an asset at two time moments, then r(t) = ln[s(t)/s(t − 1)] is the log-return at t. The models will try to predict whether the future log return r(t + 1) is positive or negative. Thus, we have a 2 × 2 system, where x ∈ {x 1 , x 2 } is the true sign, and y ∈ {y 1 , y 2 } is the prediction. The accuracy of different models will be evaluated against the theoretical upper bound, defined by the value of information.
The data used here are from the set of close-day prices s(t) of several cryptocurrency pairs between 1 January 2019 and 11 January 2021. Figure 2 shows the price of Bitcoin against USD (left) and the corresponding log returns (right). Predicting price changes is very challenging. In fact, in economics, log returns are often assumed to be independent (and hence prices s(t) are assumed to be Markov). Indeed, one can see no obvious relation on the left chart on Figure 3, which plots logreturns r(t) (abscissa) and r(t + 1) (ordinates). In reality, however, some amounts of information and correlations exist, which can be seen from the plot of the autocorrelation function for BTC/USD shown on the right chart of Figure 3.

Application: Accuracy of Time-Series Forecasts
In this section, we illustrate how the value of information can facilitate the analysis of performance of data-driven models. Here we use financial time-series data, and predict the signs of future log-returns. Thus, if s(t) and s(t − 1) are prices of an asset at two time moments, then r(t) = ln[s(t)/s(t − 1)] is the log-return at t. The models will try to predict whether the future log-return r(t + 1) is positive or negative. Thus, we have a 2 × 2 system, where x ∈ {x 1 , x 2 } is the true sign, and y ∈ {y 1 , y 2 } is the prediction. The accuracy of different models will be evaluated against theoretical upper bound, defined by the value of information. The data used here is the set of close day prices s(t) of several cryptocurrency pairs between 1 January 2019 and 11 January 2021. Figure 2 shows the price of Bitcoin against US Dollar (left) and the corresponding log-returns (right). Predicting price changes is very challenging. In fact, in economics, log-returns are often assumed to be independent (and hence prices s(t) are assumed to be Markov). Indeed, one can see no obvious relation on the left chart on Figure 3, which plots log-returns r(t) (abscissa) and r(t + 1) (ordinates). In reality, however, some amounts of information and correlations exist, which can be seen cross-correlations (correlations between log-returns of different symbols in the dataset). Thus, the vector of predictors used here is an m × n-tuple, where m is the number of symbols used, and n is the number of time lags. In this paper, we report the results of models using the range m ∈ {1, 2, . . . , 5} of symbols (BTC/USD, ETH/USD, DAI/BTC, XRP/BTC, IOT/BTC) and n ∈ {2, 3, . . . , 20} of lags. This means that the models used predictors (z 1 , . . . , z m×n ), where m × n ranged from 2 to 100. The model output y(z) is the forecast of the sign x ∈ {−1, 1} (the response) of future log return r(t + 1) of BTC/USD. Here we report results from the following models: 1.
Logistic regression (LM). This model has no hyperparameters.
Feed-forward neural network (NN). Here we used one hidden layer with three logistic units.
In order to analyse the performance of models using the value of information, one has to estimate the amount of information between the predictors z 1 , . . . , z m×n and the response variable x. Here we employ two methods. The first uses the following Gaussian formula [4]: where K i are the covariance matrices. Because the distributions of log returns are generally not Gaussian, this formula is an approximation (in fact, it gives a lower bound). The second method is based on the discretization of continuous variables. Because models were used to predict signs of log returns, here we used discretization into two subsets. Figure 4 shows the average amounts of information I(X, Z) in the training sets, computed using the Gaussian formula (left) and using binary discretization (right). Information (ordinates) is plotted against the number n of lags (abscissa) and for m ∈ {1, 2, . . . , 5} symbols (different curves). One can see that the amounts of information using Gaussian approximation (left) are generally lower than those using discretization (right). We note, however, that linear models can only use linear dependencies (correlations), which means that Gaussian approximation is sufficient for assessing the performance of linear models, such as LM and PLSD. Non-linear models, on the other hand, can potentially use all information present in the data. Therefore, we used information estimated with the second method to assess the performance of NN. Partial least squares discrimination (PLSD). We used SIMPLS algorithm [7] wi 3 components.

3.
Feed-forward neural network (NN). Here we used one hidden layer with 3 logist units.  In order to analyse the performance of models using the value of information, on has to estimate the amount of information between the predictors z 1 , . . . , z m×n and th response variable x. Here we employ two methods. The first uses the following Gaussia formula [4]: For each collection of predictors (z 1 , . . . , z m×n ) and response x, the data were split into multiple training and testing subsets using the following rolling window procedure: we used 200-and 50-day data windows for training and testing, respectively; after training and testing the models, the windows were moved forward by 50 days and the process repeated. Thus, the data of approximately 700 days (January 2019 to January 2021) were split into (700 − 200)/50 = 10 pairs of training and testing sets. The results reported here are the average of results from these 10 subsets. Figure 5 shows the accuracies of models plotted against information amounts I in the training data. The top row shows results on the training sets (i.e. fitted values) and the bottom row for new data (i.e., predicted values). Different curves are plotted for different numbers of symbols m ∈ {1, . . . , 5}. The theoretical upper bounds are shown by the Accuracy(I) curves computed using the inverse of function (6) with c = d = 1/2 and p = 1/2. Here we note the following observations: 1.
The accuracy of fitting the training data closely follows theoretical curve Accuracy(I).
The accuracy of predicting new data (testing sets) is significantly lower.

2.
Increasing information increases the accuracy on training data, but not necessarily on new data.

3.
Models using m > 1 symbols appear to achieve better accuracy than models using m = 1 symbol with the same amounts of information. Thus, surprisingly, cross-correlations potentially provide more valuable information for forecasts than autocorrelations.

Discussion
We have reviewed the main ideas of Stratonovich's value of information theory [2,4] and applied it to the simplest 2 × 2 Bayesian system. We explicitly performed the main computations for the cumulant generating function Γ(β) = Γ 0 (β) − H(X) and derived functions U(β) and I(β) defining the dependency U(I) and the value of Shannon's information V(I) = U(I) − U(0). The main application of the considered binary example the is evaluation of the accuracy of model predictions or hypothesis testing. The analysis the of performance of data-driven models can be enriched by the use of the value of information. However, one needs to be careful about the estimation of the amount of information in the data. Gaussian approximation of mutual information can be used for linear models. However, other techniques should be used for the analysis of non-linear models, such as neural networks. Here we applied the value of information to the analysis of financial Phys. Sci. Forum 2022, 5, 8 9 of 9 time-series forecasts. These methods can be generalized to many other machine learning and data science problems.