Principle of Information Increase: An Operational Perspective of Information Gain in the Foundations of Quantum Theory

A measurement performed on a quantum system is an act of gaining information about its state, a view that is widespread in practical and foundational work in quantum theory. However, the concept of information in quantum theory reconstructions is multiply-defined, and its conceptual foundations remain surprisingly under-explored. In this paper, we investigate the gain of information in quantum measurements from an operational viewpoint. We show that the continuous extension of the Shannon entropy naturally admits two distinct measures of information gain, differential information gain and relative information gain, and that these have radically different characteristics. In particular, while differential information gain can increase or decrease as additional data is acquired, relative information gain consistently grows, and moreover exhibits asymptotic indifference to the data or choice of Bayesian prior. In order to make a principled choice between these measures, we articulate a Principle of Information Increase, which incorporates Summhammer's proposal that more data from measurements leads to more knowledge about the system, and also takes into consideration black swan events. This principle favors differential information gain as the more relevant metric in two-outcome quantum systems, and guides the selection of priors for these information measures. Finally, we show that, of the beta distribution priors, the Jeffreys' binomial prior is the prior ensures maximal robustness of information gain to the particular data sequence obtained in a run of experiments.


Introduction
A measurement performed on a quantum system is an act of acquiring information about its state.This informational perspective on quantum measurement is widely embraced in practical applications such as quantum tomography [22,19,24,15], Bayesian experimental design [21], and informational analysis of experimental data [23,18].It is also embraced in foundational research.In particular, information assumes a central role in the quantum reconstruction program [14,12], which seeks to elucidate the fundamental physical origins of quantum theory by deriving its formalism from information-inspired postulates [3,10,13,11,4,20,8,16,1,7,5].Nonetheless, in the foundational exploration of quantum theory, the concept of information is articulated and formalized in many different ways, which raises the question of whether there exists a more systematic basis for choosing how to formalize the concept of information within this domain.In this paper, we scrutinize the notion of information from an operational standpoint, and propose a physically intuitive postulate to determine the appropriate information gained from measurements. of a black swan should indeed be negative.By combining this observation with Summhammer's criterion, we are lead to the Principle of Information Increase: the information gain from additional data should be positive asymptotically, and negative in extreme cases.On the basis of the Principle of Information Increase, we show that differential information gain is the more appropriate measure.
In addition, we formulate a new criterion, the robustness of information gain, for selecting priors to use with the differential information gain.The essential idea behind this criterion is as follows.If the result of the additional data D ′ is fixed, then the information gain due to D ′ will vary for different D. Robustness quantifies this difference in information gain across all possible data D. We show that, amongst the beta distributions, the Jeffreys' prior exhibits the highest level of robustness.
The quantification of knowledge gained from additional data is a topic that has received limited attention in the literature.In the realm of foundational research on quantum theory, this issue has been acknowledged but not extensively explored.Summhammer initially proposed the notion that 'more data from measurements lead to more knowledge about the system', but did not employ information theory to address this problem, instead using changes in measurement uncertainty to quantify knowledge obtained in the asymptotic limit.This approach limits the applicability of the idea, as it excludes considerations pertaining to prior probability distributions and does not readily apply to finite data.
Wootters demonstrated the significance of Jeffreys' prior in the context of quantum systems from a different information-theoretical perspective [27].In the domain of communication through quantum systems, Jeffreys' prior can maximize the information gained from measurements.Wootters approaches the issue from a more systematic perspective, utilizing mutual information to measure the information obtained from measurements.However, mutual information quantifies the average information gain over all possible data sequences, which is not suitable for addressing the specific scenario we discussed earlier, where the focus is on the information gain from a fixed data sequence.
More broadly, the question of how much information is gained with the acquisition of additional data has been a relatively under-explored topic in both practical applications and foundational research on quantum theory.Commonly, mutual information is employed as a utility function.However, as noted above, mutual information essentially represents the expected information gain averaged over all possible data sequences.Consequently, it does not address the specific question of how much information is gained when a particular additional data point is obtained.From our perspective, this averaging process obscures essential edge effects, including black swan events, which, as we will discuss, serve as valuable guides for selecting appropriate information measures.
While our investigation primarily focuses on information gain in quantum systems, the principles and conclusions we derive can be extended to general probabilistic systems characterized by fixed continuous parameters.Based on our analysis, we recommend quantification using differential information gain and the utilization of Jeffreys' prior.If one seeks to calculate the expected information gain in the next step, both the expected differential information gain and the expected relative information gain can be employed since, as we demonstrate, they yield the same result.
The paper is organized as follows.In Section 2, we detail the two information gain measures, both of which have their origins in the generalization of Shannon entropy to continuous probability distributions.We will also delve into Jaynes' approach to continuous entropy, which serves as the foundation for understanding these two information gain measures.Sections 3 and 4 focus on the numerical and asymptotic analysis of differential information gain and relative information gain.Our primary emphasis is on how these measures behave under different prior distributions.We will explore black swan events, where the additional data D ′ is highly improbable given D. In this unique context, we will assess the physical meaningfulness of the two information gain measures.In Section 5, we will discuss expected information gain under the assumption that data D ′ from additional measurements has not yet been received.Despite the general differences between the two measures, it is intriguing to note that the two expected information gain measures are equal.Section 6 presents a comparison of the two information gain measures and the expected information gain.It is within this section that we propose the Principle of Information Increase, which crystallises the results of our analysis of the two information gain measures.Finally, Section 7 explores the relationships between our work and other research in the field.

Entropy of Continuous Distribution
The Shannon entropy serves as a measure of uncertainty concerning a random variable before we have knowledge of its value.If we regard information as the absence of uncertainty, the Shannon entropy can also be used as a measure of information gained about a variable after acquiring knowledge about its value.However, it is important to note that Shannon entropy is applicable only to discrete random variables.To extend the concept of entropy to continuous variables, Shannon introduced the idea of differential entropy.Unlike Shannon entropy, differential entropy was not derived on an axiomatic basis.Moreover, it has a number of limitations.
First, he differential entropy can yield negative values, as exemplified by the differential entropy of a uniform distribution over the interval [0, 1  2 ], which equals −log 2. Negative entropy, indicating a negative degree of uncertainty, lacks meaningful interpretation.Second, the differential entropy is coordinate-dependent [6], so that its value is not conserved under change of variables.This implies that viewing the same data through different coordinate systems may result in the assignment of different degrees of uncertainty.Since the choice of coordinate systems is usually considered arbitrary, this coordinate-dependence also lacks a meaningful interpretation.
In an attempt to address the challenges associated with continuous entropy, Jaynes introduced a solution known as the limiting density of discrete points (LDDP) approach in his work [17].In this approach, the probability density p(x) of a random variable X is initially defined on a set of discrete points x ∈ x 1 , x 2 , • • • , x n .Jaynes proposed an invariant measure m(x) such that, as the collection of points x i becomes increasingly numerous, in the limit as n → ∞, With the help of m(x), the entropy of X can then be represented as In this manner, the weaknesses associated with differential entropy appear to be resolved.This quantity remains invariant under changes of variables and is always non-negative.A similar approach is also discussed in [6].However, two new issues arise.In Eq. ( 2), H(X) contains an infinite term, and the measure function m(x) is unknown.
Regarding the infinite term, two potential solutions exist.The first option is to retain this infinite term, and to reserve interpretation to the difference in the continuous entropy of two continuous distributions.The second solution is more straightforward, simply to omit the problematic log n term.

Entropy of continuous distribution as a difference
For example, when variable X is updated to X ′ due to certain actions, the decrease in entropy can be expressed as: where p ′ (x) represents the probability distribution of X ′ .In this context, we assume that the two infinite terms cancel.The quantity ∆H quantifies the reduction in uncertainty about variable X resulting from these actions.This reduction in uncertainty can also be interpreted as an increase in information.

Straightforward solution
Jaynes directly discards the infinity term in Eq. ( 2).For the sake of convenience, the minus sign is also dropped.This leads to the definition of Shannon-Jaynes information: This term quantifies the amount of information we possess regarding the outcome of X rather than the degree of uncertainty about X. H Jaynes is equivalent to the KL divergence between the distributions p(x) and m(x).
In short, there are two ways to represent the entropy of continuous distribution, and there is no obvious criterion to choose between them.In a special case where the variable X initially follows a distribution identical to the measure function, i.e., p(x) = m(x), and X undergoes evolution to X ′ with distribution p ′ (x), then we find that ∆H(X → X ′ ) = H Jaynes (X ′ ).
The remaining challenge lies in the selection of the measure function m(x).When applying this concept of continuous entropy to the relationship between information theory and statistical physics, Jaynes opted for a uniform measure function [17].However, it is far from clear that this choice is universally applicable.Currently, there is no established criterion for the choice of the measure function.It is worth noting that this measure function acts analogously to the prior distribution in the context of Bayesian probability.

Bayesian Information Gain
In a coin-tossing model, let p denote the probability of getting a head in a single toss, and N be the total number of tosses.After N tosses, the outcomes of these N tosses can be represented by an N -tuple, denoted as , where each t i represents the result of the ith toss, with t i taking values in the set {Head, Tail}.Applying Bayes' rule, the posterior probability for the probability of getting a head is given by: where Pr(p|I) represents the prior.The information gain after N tosses would be the KL divergence from the prior distribution to the posterior distribution: Based on the earlier discussion on continuous entropy, this quantity can be interpreted in two ways, either as the difference between the information gain after N tosses and the information gain without any tosses; or as the KL divergence from the posterior distribution to the prior distribution.
When considering the information gain of additional tosses based on the results of the previous N tosses, we may observe two different approaches to represent this quantity.
Let t N +1 represent the outcome of the (N +1)th toss, and let T N +1 = (t 1 , t 2 , . . ., t N , t N +1 ) denote the combined outcomes of the first N tosses and the (N + 1)th toss.The posterior distribution after these N + 1 tosses is given by: When considering information gain as a difference between two quantities, the first form of information gain for this single toss t N +1 can be expressed as: In this expression, the first term H(Pr(p|N + 1, t N +1 , I)|| Pr(p|I)) represents the information gain from 0 tosses to N + 1 tosses, while the second term H(Pr(p|N, T N , I)|| Pr(p|I)) represents the information gain from 0 tosses to N tosses.The difference between these terms quantifies the information gain in the single (N + 1)th toss (see Fig. 1).In this context, we can refer to I diff as the differential information gain in a single toss.
Alternatively, we can adopt a straightforward approach of directly calculate the information gain specifically from the N th toss to the (N + 1)th toss.Hence, the second form of information gain is defined as follows: which is simply the KL divergence from the posterior distribution after N tosses to the posterior distribution after N + 1 tosses (see Fig. 2).In this case, we refer to I rel as the relative information gain in a single toss.
In general, these two quantities, I diff and I rel , are not the same, unless N = 0, which implies that no measurements have been performed.I diff could take on negative values, while I rel is always non-negative due to the properties of the KL divergence.Although KL divergence is not a proper distance metric between probability distributions (as it does not satisfy the triangle inequality), it is a valuable tool for illustrating the analogy of displacement and distance in a random walk model.This analogy helps elucidate the subtle difference between the two types of information gain.
We aim to determine which information gain measure is a more suitable choice and introduce an informational postulate to guide our decision.This postulate comes from an intuitive idea: 'more measurements lead to more knowledge about the physical system' [25,26].For instance, when measuring the value of a physical quantity, we often perform multiple identical measurements to reduce statistical fluctuations in the values.We contemplate whether this idea can be reformulated prior posterior postposterior Figure 1: Differential Information Gain in a Single Toss.Assuming we have data from the first N tosses, denoted as T N .Using a specific prior distribution, we can calculate the information gain for these first N tosses, denoted as I(N ).If we now consider the (N + 1)th toss and obtain the result t N +1 , we can repeat the same procedure to calculate the information gain for a total of N + 1 tosses, denoted as I(N + 1).The information gain specific to the (N + 1)th toss can be obtained as the difference between I(N + 1) and I(N ).
prior posterior postposterior Relative Information Gain in a Single Toss.The posterior distribution calculated from the results of the first N tosses serves as the prior for the (N +1)th toss.The KL divergence between this posterior and the subsequent posterior represents the information gain in the (N + 1)th toss. in terms of information theory.If we quantify 'knowledge' in terms of information gain from data, this notion suggests that the information gain from additional data should be positive if it indeed contributes to our understanding.
This consideration makes relative information gain an appealing choice, as it is always nonnegative.However, the derivation of differential information gain also carries significance.This leads to the question of whether this intuitive idea has physical meaning, and if not, what might be a reasonable interpretation of it.In the following sections, we will delve into the concept of differential information gain, both in finite N cases and asymptotic cases.We will explore the implications of negative values of information gain, particularly in extreme situations.Additionally, we will conduct numerical and asymptotic analyses of relative information gain.After analyzing both information gain measures, we will be better equipped to compare and establish connections between them, and to assess the physical meaningfulness of the intuitive idea we set out to explore.

Finite number of tosses
The prior distribution we employ is the beta distribution, which serves as the conjugate prior for the binomial distribution.
In general, the beta distribution is characterized by two parameters, and for the sake of convenience, we work with a simplified single-parameter beta distribution.This single-parameter beta distribution encompasses a wide spectrum of priors, including the uniform distribution (when α = 0) and Jeffreys' prior (when α = −0.5).
The differential information gain of the (N + 1)th toss is (see Appendix A) In this context, we assume that t N +1 = 'Head'.It's worth noting that there is also a corresponding I diff (t N +1 = 'Tail'), but for brevity, we will focus on a single case.Our calculations consider all possible values of T N , and the expressions for both cases (Head and Tail) are symmetric.I diff is a function of h N and α, and h N ranges from 0 to N .We will select a specific value for α and calculate all the N + 1 values of I diff for each N .

Positivity of I diff
Returning to our initial question: "Will more data leads to more knowledge?".If we use the term "knowledge" to represent the differential information gain and use I diff to quantify the information gained in each measurement, the question becomes rather straightforward: "Is I diff always positive?" In Figure 3, we present the results of numerical calculations for various values of N .Upon close examination of the graph, it becomes evident that the I diff is not always positive, except under specific conditions.In the following sections, we will delve into the conditions that lead to exceptions.
For certain priors, the differential information gain is consistently positive (Figure 3a); while for other priors, both positive and negative regions exist (Figure 3b, 3c, 3d).It's worth noting that for priors leading to negative regions, the lowest line exhibits greater dispersion compared to the other data lines.This lower line represents the scenario where the first N tosses all result in tails, but the (N + 1)th toss yields a head.This situation is akin to a black swan event, and negative Figure 3: Differential Information Gain (I diff ) vs. N for Different Priors.Here, the y-axis represents the value of I diff , and the x-axis corresponds to the value of N .In each graph, we fix the value of α to allow for a comparison of the behavior of I diff under different priors.Given N , there are N + 1 points as h N ranges from 0 to N .Notably, for α = −0.7,all points lie above the x-axis, while for other priors, negative points are present and the fraction of negative points becomes constant as N increases.The asymptotic behavior of this fraction will be shown in Figure 4.Moreover, it appears that the graph is most concentrated when α = −0.5, whereas for α < −0.5 and α > −0.5, the graph becomes more dispersed.This dispersive/concentrating feature is clearly depicted in Figure 6.information gain in this extreme case holds significant meaning-if we have tossed a coin N times and obtaining all tails, we anticipate another tail in the next toss; hence, receipt of a heads on the next toss raises the degree of uncertainty about the outcome of the next toss, leading to a reduction in information about the coin's bias.

Fraction of Negatives
In order to illustrate the variations in the positivity of information gain under different priors, we introduce a new quantity known as the Fraction of Negatives (FoN), which represents the ratio of the number of h N values that lead to negative I diff and N + 1.For instance, if, for a given α, N = 10 and I diff < 0 when h N = 0, 1, 2, 3, the FoN under this α and N is 4  11 .From Figure 4, we identify a critical point, denoted as α p , which is approximately −0.7.For any α ≤ α p , I diff is guaranteed to be positive for all N and h N values.
If α > α p , negative terms exist for some h N ; however, the patterns of these negative terms differ across various α values.
A clearer representation of the critical point α p and the turning point α 0 can be found in Figure 5, where the critical point α p is approximately −0.68.

Robustness of I diff
In Figure 3, different priors not only exhibit varying degrees of positivity but also display varying degrees of variation in I diff for different values of h N , to which we refer as divergence.The divergence depends upon the choice of prior.To better understand this dependence, we quantify the dependence of I diff on h N by the standard deviation of I diff across different values of h N .Figure 6 illustrates how the standard deviation changes with respect to α while keeping N constant.
It is evident that when α is close to −0.5, the standard deviation is at its minimum.Reduced dependence of I diff on h N enhances its robustness against the effects of nature, as we attribute h N to natural factors while N is determined by human measurement choices.As N increases, the minimum point approaches −0.5.In the limit of large N , this minimum point will eventually converge to α = − 1 2 , which means that, under this specific choice of prior, I diff depends minimally on h N and primarily on N .

Large N approximation
Utilizing a recurrence relation and the large x approximation, the digamma function can be approximated as follows: As a result, the large N approximation for the differential information gain in Eq. ( 11) becomes: It is evident that when α = − 1 2 , I diff = 1 2(N +1) , indicating that I diff solely depends on N .This finding aligns with Figure 3, which demonstrates that I diff is most concentrated when α = −0.5, and is consistent with the results of [9]. Figure 4: Fraction of Negatives (FoN) vs. N under different α.In Figure 3, we can observe that larger α values lead to more dispersed lines and an increased number of negative values for each N .We use FoN to quantify this fraction of negative points.It appears that for α ≤ −0.7, FoN is consistently zero, indicating that I diff is always positive.For α ≤ −0.5 FoN decreases and tends to be zero as N becomes large, while for α > −0.5, FoN tends to a constant as N increases, and this constant grows with increasing α.
In Figure 4, we observe that the FoN tends to become constant for very large N .These constants can be estimated using the large N approximations of I diff in Eq. ( 13).If I diff ≤ 0, then Figure 5: Fraction of Negatives (FoN) vs. α for Different N .We identify a critical point, denoted as α p , where the FoN equals zero when α ≤ α p .The critical point exhibits a gradual variation with respect to N , following these patterns: (i) for small N , α p is close proximity to −0.68; (ii) for large N , α p tends to −0.5.
and we obtain: This equation aligns with the asymptotic lines in Figure 4, providing support for the observation mentioned in Figure 3, namely that, for α = −0.7,all points lie above the x-axis, while for other priors, negative points are present and the fraction of negative points becomes constant.

Relative Information Gain
The second form of information gain in a single toss is relative information gain, which represents the KL divergence from the posterior after N tosses to the posterior after N +1 tosses.We continue to use the one-parameter beta distribution prior in the form of Eq. ( 10).The relative information gain is (see appendix B): Relative information gain exhibits entirely different behavior compared to differential information gain.Due to the properties of KL divergence, relative information gain is always non-negative, eliminating the need to consider negative values.We aim to explore the dependence of relative information gain on priors and the interpretation of information gain in extreme cases.
In Figure 7, it becomes evident that, under different priors, the data lines exhibit similar shapes.This suggests that relative information gain is relatively insensitive to the choice of priors.On each graph, the top line represents the extreme case where the first N tosses result in tails, and the (N + 1)th toss results in a head.This line is notably separated from the other data lines, indicating that relative information gain behaves more like a measure of the degree of surprise associated with this additional data.In this black swan event, the posterior after N + 1 tosses differs significantly from the posterior after N tosses.
For small N , both the average value and standard deviation of I rel exhibit a clear monotonic relationship with α, meaning that larger values of α result in smaller average values and standard deviations.However, as N becomes large, all priors converge and become indistinguishable.Nonetheless, it is important to note that relative information gain remains heavily dependent on the specific data sequences (h N ).By utilizing the approximation of the digamma function, we can obtain: In the large N limit, I rel becomes: Thus, it appears that the properties of relative information gain and differential information gain are complementary to each other.The differences between them are summarized in Table 2.It is important to note that I rel is consistently positive across these selected priors.Similar to the differential information gain, each graph displays numerous divergent lines.However, the shape of these divergent lines remains remarkably consistent across varying values of α.The majority of these lines fall within the range of I rel between 0 and 0.2.
Figure 8: Robustness of Relative Information Gain (I rel ).The y-axis represents the standard deviation of I rel across all possible h N .This demonstrates the substantial independence of I rel from h N .Additionally, as N increases, the standard deviations tend to approach zero for all priors.

Expected Information Gain
In this section we discuss a new scenario: After N tosses, but before the (N + 1)th toss has been taken, can we predict how much information gain will occur in the next toss?The answer is affirmative, as discussed earlier.
After N tosses, we obtain a data sequence T N with h N heads.However, we can only estimate the probability p based on the posterior Pr(p|N, T N , I).The expected value of p can be expressed Asymptotic forms (t N +1 = 'Head') Asymptotic sensitivity about prior Diff Infor.Gain Heavily dependent on prior and for some certain prior I diff will be independent of h N Rel Infor.Gain Insensitive to prior and for large N only affected by h N Table 2: Comparison of characteristics of two measures of information gain. as: Based on this expected value of p, we can calculate the average of the information gain in the (N + 1)th toss.We define the expected differential information gain in the (N + 1)th toss as: I diff represents the expected value of differential information gain in the (N +1)th toss.Similarly, we can define the expected relative information gain as: Surprisingly, I diff = I rel .This relationship holds true for any prior, not being limited to the beta distribution type prior.Please refer to appendix C for a detailed proof.This suggests that there is only one choice for the expected information gain.We first show the numerical results of expected information gain under different priors.It is evident that all data points are above the x-axis, indicating that the expected information gain is positive-definite, as anticipated.Since both I rel and ⟨p⟩ are positive, it follows that I rel must also be positive.
As with the discussions of differential information gain and relative information gain, we are also interested in examining the dependence of expected information gain on α or h N .However, such dependence appears to be weak, as illustrated in Figure 9 and Figure 10.Expected information gain demonstrates strong robustness concerning variations in α and h N .
The asymptotic expression of expected information gain is

Comparison of Three Information Gain Measures, and the Information Increase Principle
From an operational perspective, the information measures we have considered can be categorized into two types: differential information gain and relative information gain pertain to a measurement that has already been made, while expected information gain pertains to a measurement that has yet to be conducted.As regards to positivity, which is tied to the fundamental question of "Will acquiring more data from measurements lead to a deeper understanding of the system?": for relative information gain and expected information gain, the answer is affirmative; but, differential information gain is positive only under certain specific prior conditions.range of θ and that of p, In large N approximation, ∆p = p(1 − p)/N , so that One intuitive way to ensure Eq.( 23) holds is by forcing ∆θ to be purely a function of N .Observing the relationship between ∆θ and ∆p, the simplest solution would be to set ∆θ = const.√ N .Under this solution, the relationship between p and θ takes the following form: which yields Malus' law p(θ) = cos 2 (m(θ − θ 0 )/2) with m ∈ Z. Summhammer does not employ information theory to quantify 'knowledge about a physical quantity', but instead utilizes the statistical uncertainty associated with the quantity.However, viewed from the Bayesian perspective, if we assume that the prior distribution of the physical quantity, θ, is uniform, the between θ and p in Eq. ( 25) implies that the prior distribution of the probability follows Jeffreys' binomial prior, Thus, in the large N approximation, Summhammer's result can be interpreted to mean that the prior associated with the probability of a uniformly-distributed physical quantity must adhere to Jeffreys' binomial prior.Goyal [9] introduces an asymptotic Principle of Information Gain (which differs from ours), which states that "In n interrogations of a N -outcome probabilistic source with an unknown probabilistic vector ⃗ P , the amount of Shannon-Jaynes information provided by the data about ⃗ P remains independent of ⃗ P for all ⃗ P in the limit as n → ∞." Goyal establishes the equivalence between this principle and Jeffreys' rule.Under his Principle of Information Gain, the Jeffreys' multinomial prior is then derived.In the case of a two-outcome probabilistic model, the Jeffreys' multinomial prior reduces to Jeffreys' binomial prior.Asymptotic analysis reveals that Shannon-Jaynes information is not only independent of the probability vector ⃗ P , but also monotonically increases with the number of interrogations.It is worth noting that Shannon-Jaynes information can be viewed as the accumulation of differential information gain.This asymptotic result aligns with our findings: under Jeffreys' binomial prior, the differential information gain is solely dependent on the number of measurements.

Other Information-theoretical motivations of Jeffreys' binomial prior
Wootters [27] introduces a novel perspective on Jeffreys' binomial prior, where quantum measurement is employed as a communication channel.In this framework, Alice aims to transmit a continuous variable, denoted as θ, to Bob.Instead of directly sending θ to Bob, Alice transmits a set of identical coins to Bob, where the probability of getting heads, p(θ), in each toss is a function of θ.Bob's objective is to maximize the information about θ that he can extract from a finite number of tosses.The measure of information used in this context is the mutual information between θ and the total number of heads, n, in N tosses.In the large N approximation, it is found that the weight w takes on a specific form: which serves a role akin to the prior probability of p. Remarkably, this prior probability aligns with Jeffreys' binomial prior.A similar procedure can be extended to Jeffreys' multinomial prior distribution.Wootters' approach shares similarities with the concept of a reference prior, where the selected prior aims to maximize mutual information, which can be viewed as the expected information gain across all data.The outcome is consistent with the reference prior for multinomial data [2], thus revealing another informational interpretation of Jeffreys' prior.

Conclusion
In this paper, we delve into the concept of information gain for two-outcome quantum systems from an operational perspective.We introduce an informational postulate, the Principle of Information Increase, which serves as a criterion for selecting the appropriate measure to quantify the extent of information gained from measurements and the choice of prior.Our investigation reveals that the differential information gain emerges as the most physically meaningful measure when compared to another contender, the relative information gain.The Jeffreys' binomial prior exhibits notable characteristics within the realm of two-outcome quantum systems.Both Summhammer and our work demonstrate that under this prior, the intuitive notion that more data from measurements leads to more knowledge about the system holds true, as confirmed by two distinct methods of quantifying knowledge.Additionally, Wootters shows that this prior enables the communication of maximal information, further highlighting its significance.We also find that Jeffreys' binomial prior displays robustness, although the origin of this robustness remains unexplained.It raises an intriguing question of whether this feature could be extended to multinomial distributions or other types of probabilistic systems.We speculate that there might be another layer of intuitive understanding related to the robustness of Jeffreys' prior.
While this paper primarily focuses on the single-parameter beta distribution prior for binomial distributions, we anticipate similar results could manifest in multinomial distributions.However, it remains an open question how differential information gain behaves under more general types of priors and distributions.

Figure 6 :
Figure6: Robustness of Differential Information Gain (I diff ).The y-axis represents the logarithm of the standard deviation of I diff over all possible h N values, while the x-axis depicts various selections of α.A smaller standard deviation indicates that different h N values lead to the same result, implying greater independence of I diff from h N .This independence signifies the robustness of I diff with respect to the natural variability in h N , as we consider h N to be solely determined by nature.The standard deviation, given a fixed N , is notably influenced by α, and there exists an α value at which the dependence on h N is minimized.This particular α value approaches −0.5 as N increases.

Figure 7 :
Figure 7: Relative Information Gain (I rel ) over Different Priors.The y-axis represents the value of I rel , while the x-axis represents N .For each N there are N + 1 different values of I rel .It is important to note that I rel is consistently positive across these selected priors.Similar to the differential information gain, each graph displays numerous divergent lines.However, the shape of these divergent lines remains remarkably consistent across varying values of α.The majority of these lines fall within the range of I rel between 0 and 0.2.

Figure 9 :
Figure 9: Expected Information Gain vs. N for Fixed α.The y-axis represents the value of expected information, while the x-axis represents the value of N .Notably, all expected information gain values are positive.The shapes of each graph exhibit remarkable similarity with a limited number of divergent lines.As α increases, the number of divergent lines decreases.

Figure 10 :
Figure 10: Robustness of Expected Information Gain.The y-axis represents the standard deviation of the expected information gain over all possible h N , while the x-axis represents the value of N .As N increases, and even for relatively small values of N , the standard deviation tends toward zero for all priors.

Table 1 :
Fraction of Negatives (FoN) under Selected Priors.A comparison between numerical results and asymptotic results demonstrates their agreement.