A Formal Framework for Knowledge Acquisition: Going beyond Machine Learning

Philosophers frequently define knowledge as justified, true belief. We built a mathematical framework that makes it possible to define learning (increasing number of true beliefs) and knowledge of an agent in precise ways, by phrasing belief in terms of epistemic probabilities, defined from Bayes’ rule. The degree of true belief is quantified by means of active information I+: a comparison between the degree of belief of the agent and a completely ignorant person. Learning has occurred when either the agent’s strength of belief in a true proposition has increased in comparison with the ignorant person (I+>0), or the strength of belief in a false proposition has decreased (I+<0). Knowledge additionally requires that learning occurs for the right reason, and in this context we introduce a framework of parallel worlds that correspond to parameters of a statistical model. This makes it possible to interpret learning as a hypothesis test for such a model, whereas knowledge acquisition additionally requires estimation of a true world parameter. Our framework of learning and knowledge acquisition is a hybrid between frequentism and Bayesianism. It can be generalized to a sequential setting, where information and data are updated over time. The theory is illustrated using examples of coin tossing, historical and future events, replication of studies, and causal inference. It can also be used to pinpoint shortcomings of machine learning, where typically learning rather than knowledge acquisition is in focus.


The Present Article
The process by which cognitive agents acquire knowledge is complicated, and has been studied from different perspectives within educational science, psychology, neuroscience, cognitive science, and social science [1]. Philosophers usually distinguish between three types of knowledge [2]: acquaintance knowledge (to get to know other persons), knowledge how (to learn certain skills), and knowledge that (to learn about propositions or facts). Mathematically, acquaintance knowledge has been studied via trees and networks, for instance, in small-world-type models and rumor-spreading models [3][4][5]. Knowledge how has been widely developed in education and psychology, since the middle of the twentieth century, by means of testing and psychometry, using classical statistics [6][7][8].
The purpose of this paper is to formulate knowledge that in mathematical terms. Our starting point is to define knowledge that as justified true belief (JTB), which generally is agreed to constitute at least a sufficient condition for such knowledge [9,10]. The primary tools will be the concepts of truth, probabilities, and information theory. Probabilities, in addition to logic, are used to formulate mechanisms of reasoning in order to define beliefs [11,12]. More specifically, a Bayesian approach with subjective probabilities will be used to quantify rational agents' degrees of beliefs in a proposition. These subjective probabilities may vary between agents, but since each agent is assumed to be rational, its probabilities satisfy basic axioms of probability [13]. This is also referred to as the personalistic view of probabilities in [14].
The degree of belief in a proposition is associated with some type of randomness or uncertainty regarding the truth of the proposition. It is helpful in this context to distinguish between ontological randomness (genuine randomness regarding the truth of the proposition) and epistemic randomness (incomplete knowledge about propositions that are either true or false). Here the focus will be on epistemic randomness, and following [15], subjective probabilities are referred to as epistemic probabilities. The epistemic randomness assumption that each proposition has a fixed truth value can be viewed as a frequentist component of our framework.
To use epistemic probabilities in a wider context of knowledge that (subsequently simply referred to as knowledge), we incorporate degrees of beliefs within a framework of parallel worlds in order to define more clearly what JTB means. These parallel worlds correspond to parameters of a statistical model and a second frequentist notion of one parameter being true, whereas the others are counterfactuals [16]. An agent's maximal possible discernment between worlds is described in terms of the σ-algebra G. The agent's degrees of belief are obtained through Bayes' rule from prior belief and data [17], in such a way that it is not possible to discern between worlds beyond the limits set by G.
Learning is associated with increased degrees of true belief, although these beliefs need not necessarily be justified. More specifically, the agent's degree of belief in a proposition is compared to that of an ignorant person. This corresponds to an hypothesis test within a frequentist framework. More specifically, the null hypothesis of a proposition being true is tested against an alternative hypothesis that the proposition is false. As a test statistic, we use active information I + [18][19][20], which quantifies how much the agent has learned about the truth value of the proposition compared to an ignorant person. In particular, learning has occurred when the agent's degree of belief in a true proposition is larger than that of an ignorant person (I + > 0), or if the agent's degree of belief in a false proposition is less than that of an ignorant person (I + < 0). In either case, G sets a limit in terms of the maximal amount of possible learning. Learning is, however, not sufficient for knowledge acquisition, since the latter concept also requires that the true belief is justified, or has been formed for the right reason. Knowledge acquisition is defined as a learning process where the agent's degree of belief in the true world is increased, corresponding to a more accurate estimate of the true world parameter. Thus, knowledge acquisition goes beyond learning in that it also deals with the "justified" part of the JTB condition. It is related to consistency of a posterior distribution, a notion that is meaningful only within our hybrid frequentist/Bayesian approach.
To the best of our knowledge, the hybrid frequentist/Bayesian approach has only been used in the context of Bayesian asymptotic theory (Section 7.2), but not as a general tool for modeling the distinction between learning and knowledge acquisition. Although the concept of a true world (or the true state of affairs) is used in the context of Bayesian decision theory and its extensions, such as robust Bayesian inference and belief functions based on the Dempster-Shafer theory [21][22][23][24], the goal is then to maximize an expected utility (or to minimize an expected cost) of the agent that makes the decision. In our context, the Bayesian approach is only used to formulate beliefs as posterior distributions, whereas the criteria for learning (probabilities of rejecting a false or true proposition) and knowledge acquisition (consistency) are frequentist. Given that a model with one unique, true world is correct, the frequentist error probability and consistency criteria are objective, since they depend on the true world. No such criteria exist within a purely Bayesian framework.

Illustration 1.
In order to illustrate our approach for modeling learning and knowledge acquisition, we present an example that will be revisited several times later on. A teacher (the agent) wants to evaluate whether a child has learned addition. The teacher gives the student a home assignment test with two-choice answers, one right and one wrong, to measure the proposition S: "The child is expected to score well on the test." In this case, we have a set X = {x 1 , x 2 , x 3 } of three possible worlds. An ignorant person who does not ask for help is expected to have half her questions right and half her questions wrong (x 1 ). A child who knows addition is expected to get a large fraction of the answers right (x 2 ). However, there is also a third alternative, where an ignorant student asks for help and is expected to have a high score for that reason (x 3 ). Notice in particular that S is true only for the two worlds of the set A = {x 2 , x 3 }. If the child answers substantially more questions right than wrong, the active information will be positive and the teacher learns S. However, this learning that S is true does not represent knowledge of whether the student knows how to add, since the teacher is not able to distinguish x 2 from x 3 . Now, let us say that the test has only two questions. In this setting, it is expected that an ignorant person has one question right and one wrong. However, it is also highly probable that even if the child does not know his sums well, he can answer the two questions in the right way. In this case, the teacher has not learned substantially about S (nor attained knowledge of whether the student knows how to add). The reason is that, since the test has only two questions, the teacher cannot exclude any of x 1 , x 2 , and x 3 . The more questions the test has, and if the student scores well, the more certain the teacher is that either x 2 or x 3 is true, that is, the more he learns about S. If the student is also monitored during the exam, alternative x 3 is excluded and the teacher knows that x 2 is true; that is, the teacher not only learns about S, but also acquires knowledge that the student knows how to add.
Each of the following sections contains remarks and illustrations like the previous one. At the end of the paper, a whole section with multiple examples will explore deeper how the model works in practice.

Related Work
Other contributions have been made to developing a mathematical framework for learning and knowledge acquisition. Hopkins [25] studied the theoretical properties of two different models of learning in games, namely, reinforcement learning and stochastic fictitious play. He developed an equivalence relation between the two under a variety of different scenarios with increasing degrees of structure. Stoica and Strack [26] introduced a stochastic model for acquired knowledge and showed that empirical data fit the estimated outcomes of the model well, using data from student performance in university-level classes. Taylor [27] proposed a model using the notion of concept lattices and the mathematical theory of closure spaces to describe knowledge acquisition and organization. However, none of these works has been developed through basic concepts in probability and information theory the way we do here. Our approach permits important generalizations which cover a wide range of real-life scenarios.

Possible Worlds, Propositions, and Discernment
Consider a collection X of possible worlds, of which x 0 ∈ X is the true world, and all other worlds x ∈ X \ {x 0 } are counterfactuals. We will regard x as a statistical parameter, and the fact that this parameter has a true but unknown value x 0 corresponds to a frequentist assumption. The set X is the parameter space of interest, and it is assumed to be either finite or a bounded and open subset of Euclidean space R q of dimension q. Let S be a proposition (or statement), and impose a second frequentist assumption that S is either true or false, although the truth value of S may depend on the world x ∈ X . Define a binary-valued truth function f : X → {0, 1} by f (x) = 1 or 0, depending on whether S is true or not in world x. The set A = {x ∈ X ; f (x) = 1} consists of all worlds for which S is a true proposition. Although there is one-to-one correspondence between f and A, in the sequel it will be convenient to use both notions. The simplest truth scenario of S is one for which the truth value of S is unique for the true world, i.e., (1) x 0 being unique and f being binary-valued together correspond to a framework of epistemic randomness, where the actual truth value f (x 0 ) of S is either 0 or 1. S is referred to as falsifiable [28] if it is logically possible (in principle) to find a data set D implying that the truth value of S is 0, or equivalently, that none of the worlds in A is true. It is possible though to falsify S without knowing x 0 .

Degrees of Beliefs and Sigma Algebras
Let (X , F ) be a measurable space. When X is finite, F consists of all subsets of X (i.e., F = 2 X ); otherwise, F is the class of Borel sets. The Bayesian part of our approach is to quantify an agent's belief in which a world is true by means of an epistemic probability measure P on the measurable space (X , F ), whereas the beliefs of an ignorant person follow another probability measure P 0 . It is often assumed that is the uniform probability measure that maximizes entropy among all probability measures on (X , F ), where | · | refers to the cardinality for finite X and to the Lebesgue measure for continuous X . Then, (2) corresponds to a maximal amount of ignorance about which possible world is true [29]. Sometimes (as in Example 5 below) some general background knowledge is assumed also for the ignorant person, so that P 0 differs from (2). The agent's and the ignorant person's strength of belief in S are quantified by P(A) and P 0 (A), respectively. Following [15], it is helpful to interpret P(A) and P 0 (A) as the agent's and the ignorant person's predictions of the physical probability f (x 0 ) ∈ {0, 1} of S. Whereas P and P 0 involve epistemic uncertainty, the physical probability is an indicator for the real (physical) event that S is true or not.
When an agent's belief P is formed, it is assumed that any information accessible to him, beyond that of the ignorant person, belongs to a sub-σ-algebra G ⊂ F . This means that the agent has no more knowledge of how to discern events in G than the ignorant person, if this discernment requires that he considers events in F that do not belong to G. Mathematically, this corresponds to a requirement for all F -measurable functions g : X → R, and all sigma algebras G such that G ⊆ G ⊆ F . It is assumed, on the left-hand side of (3), that g is a random variable defined on the probability space (X , F , P), whereas g is defined on the probability space (X , F , P 0 ) on the right-hand side of (3). It follows from (3) that G sets the limit in terms of the agent's possibility to form propositions about which world is true. Therefore, G is referred to as the agent's maximal possible discernment about which world is true. It follows from (3) that The minimal amount of discernment corresponds to the trivial σ-algebra G 0 = {∅, X }. Whenever (3) holds with G = G 0 , necessarily P = P 0 . This corresponds to removing the outer expectation on the right-hand side of (4), so that Remark 1. Suppose there exists an oracle or omniscient agent O that is able to discern between all possible worlds and also knows x 0 . Mathematically, the discernment requirement means that O has knowledge about all sets in a σ-algebra F that corresponds to a maximal amount of discernment between possible worlds. We will assume that f is measurable with respect to F , so that A is measurable (i.e., A ∈ F ). Knowledge of F is, however, not sufficient for knowing A, since A may involve x 0 , as in (1). By this we mean that if the agent knows F , and if A involves x 0 , then there are several candidates of A for the agent, and he does not know a priori which one of these candidates is the actual A. However, since O knows F and x 0 , he also knows A. It follows that O knows that S is true for all worlds in (the actual) A, and that S is false for all worlds outside of (the actual) A. That is, the oracle knows for which possible worlds the proposition S is true.
As mentioned in Remark 1, the truth function f is measurable with respect to the maximal σ-algebra F . However, depending on how G is constructed, and whether A involves x 0 or not, the set A may or may not be known to the agent. Therefore, when A involves x 0 , the agent may not be able to compute P 0 (A) and P(A) himself. Although he is able to compute P 0 (B) and P(B) for all B ∈ F , since he does not know x 0 , it follows that he does not know which of these sets B equals A. Therefore, he does not know P(A) and P 0 (A), unless P(B) = P(A) and P 0 (B) = P 0 (B), respectively, for all B that are among the agent's candidates for the set A. As will be seen from the examples of Section 8, it is often helpful (but not necessary) to construct G as the σ-algebra that is generated by a random variable Y whose domain is X (i.e., G = σ(Y)). This means that Y determines the collection G of subsets of X for which the agent is free to form beliefs beyond that of the ignorant person. Typically, Y highlights the way in which information is lost by going from F to G. For instance, suppose is the sigma-algebra obtained by from a quantization procedure with accuracy δ.

Bayes' Rule and Posterior Probabilities
A Bayesian approach will be used to define the agent's degree of belief P. To this end, we regard x ∈ X as a parameter of a statistical model and that the agent has access to data d ∈ D. The agent assumes that (x, d) is an observation of a random variable (X, D) : Ω → X × D defined on some sample space Ω. The joint distribution of the parameter X and data D, according to the agent's beliefs, is dQ(x, D) = dP 0 (x)L(D|x)dD. This is a probability measure on subsets of X × D, with prior distribution P 0 of the parameter X, and with a conditional distribution L(D|x) = dQ(D|x)/dD that corresponds to the likelihood of data D. A posterior probability of A is formed by updating the prior distribution P 0 based on data D. It is assumed that the likelihood x → L(D | x) is measurable with respect to G, so that data conform with the agent's maximal possible discernment between possible worlds. The likelihood function x → L(D | x) includes the agent's interpretation of D. Although this interpretation may involve a subjective part, it is still assumed that the agent is not willing to speculate about possible worlds beyond the limits set by G. That is, whenever the agent discerns events in G beyond the limits set by G, this discernment is the same as for an ignorant person.

Remark 2.
To account for the possibility that the agent still speculates beyond the limits set by external data, G = σ(G ext , G int ) could be defined as the smallest σ-algebra containing the σ-algebras G ext and G int that originate from external data D ext and internal data D int (the agent's internal experiences, such as dreams and revelations, respectively). Note, however, that x → L(D | x) is subjective, even when internal data are absent, since agents might interpret external data in different ways, due to the way in which they perceive such data and incorporate previous life experience.
From Bayes' rule we find that the posterior distribution satisfies A couple of additional reasons reinforce the subjectivity of P: the prior P 0 might be subjective, and acquisition of data D might vary between agents [30]. Additionally, acquisition of data D will not necessarily make P more concentrated around the true world x 0 , since it is possible that the data themselves are biased or that the agent interprets the data in a sub-optimal way.
Since the likelihood function is measurable with respect to G, it follows from (4) that the agent's belief P, after having observed D, does not lead to a different discernment between possible worlds beyond G than for an ignorant person. Given G, together with an unlimited amount of unbiased data that the agent interprets correctly, the G-optimal choice of P is Equations (4) and (8) uniquely define the G-optimal choice of P. Whenever G ⊂ F is a proper subset of the maximal σ-algebra F , the measure P in (8) is not the same thing as a point mass δ x 0 at x 0 . On the other hand, for an oracle with a maximal amount of knowledge about which world is true, G = F , (8) reduces to a point mass at the true world-i.e., Remark 3. An extreme example of biased beliefs is a true-world-excluding probability measure, with support that does not include x 0 : Another example is a correct-proposition-excluding probability measure, with support that excludes all worlds x with a correct value f (x) = f (x 0 ) of S: Illustration 2 (Continuation of Illustration 1). Suppose data D ∈ D = {0, 1, . . . , 10} are available to the teacher (the agent) in terms of the number of correct answers of a home assignment test with 10 questions. The prior P 0 (x i ) = 1/3 is uniform on X = {x 1 , x 2 , x 3 }, whereas data D|x i ∼ Bin(10, π i ) have a binomial distribution with probabilities π 1 = 0.5 and π 2 = π 3 = 0.8 of answering each question correctly, for a student that either guesses or has math skills/asks for help. Let d be the observed value of D. Since data have the same likelihood (L(d|x 2 ) = L(d|x 3 )) for a student who scores well, regardless of whether he knows how to add or gets help, it is clear that the posterior distribution . Since the teacher cannot distinguish x 2 from x 3 , his sigma-algebra has only four elements, whereas the full sigma-algebra F = 2 X consists of all eight subsets of X . Note that Equation (3) stipulates that the teacher cannot discern between the elements of X , beyond the limits set by G, better than the ignorant person. In order to verify (3), since there is no sigma-algebra between G and F , we only need to check this equation for G = G. To this end, let g : X → R be a real-valued function. Then, since P(x 2 ) = P(x 3 ), it follows that in agreement with (3).

Illustration 3.
During the Russo-Japanese war, the czar Nicholas II was convinced that Russia would easily defeat Japan [31]. His own biases (he considered the Japanese weak and the Russians superior) and the partial information he received from his advisors blinded him to reality. In the end, Russian forces were heavily beaten by the Japanese. In this scenario, the proposition S is "Russia will beat Japan", X consists of all possible future scenarios, and f (x) = 1 for those scenarios x ∈ X in which Russia would win the war. As history reveals, f (x 0 ) = 0. The information he received from his advisors was D, and we know it was heavily biased. Nicholas II adopted (very subjectively!) a correct-proposition-excluding probability measure, as in (11), because he did not even consider the possibility of Russia being defeated. The main reason was a dramatically poor assessment of the likelihood L(D | x), for x ∈ X , on top of a prior P 0 that had a low probability for scenarios x ∈ A c . Nicholas II's verdict was P(A) ≈ 1.

Expected Posterior Beliefs
Since D is random, so is P. For this reason, the expected posterior distribution will be used occasionally, with an expectation corresponding to averaging over all possible data sets D according to its distribution L(·|x 0 ) in the true world. Consequently,P(A) represents the agent's expected belief in S in the true world x 0 . Note in particular that in contrast to the posterior P, the expected posteriorP is not a purely Bayesian notion, since it depends on x 0 .

Active Information for Quantifying the Amount of Learning
The active information (AIN) of an event B is In particular, I + (A) quantifies how much an agent has learned about whether S is true or not compared to an ignorant person. By inserting (7) into (14), we find that the AIN is the logarithm of the ratio between how likely it is to observe data when S holds, and how likely data are when no assumption regarding S is made (see also [32]). The corresponding AIN for expected degrees of beliefs is

Definition 1 (Learning).
Learning about S has occurred (conditionally on observed D) if the probability measure P either satisfies I + (A) > 0 when x 0 ∈ A or I + (A) < 0 when x 0 / ∈ A. In particular, full learning corresponds to I + (A) = − log P 0 (A) when x 0 ∈ A and I + (A) = −∞ when x 0 / ∈ A. Learning is expected to occur if the probability measureP is such thatĪ + (A) > 0 when x 0 ∈ A orĪ + (A) < 0 when x 0 / ∈ A. In particular, full learning is expected to occur if Remark 4. Two extreme scenarios for the active information, when x 0 ∈ A, are According to Definition 1, the upper part of (17) represents full learning-that is, P(A) = 1; whereas the lower part corresponds to a maximal amount of false belief about S when x 0 ∈ A-that is, P(A) = 0.

Remark 5.
Suppose S is a proposition that a certain entity or machine functions; then, − log P 0 (A) is the functional information associated with the event A of observing such functioning entity [33][34][35]. In our context, functional information corresponds to the maximal amount of learning about S when the machine works ( f (x 0 ) = 1).

Learning as Hypothesis Testing
It is possible to view the AIN in (15) as a test statistic for choosing between the two statistical hypotheses with the null distribution H 0 being rejected (conditionally on observed D) when for some threshold I [36][37][38]. Typically, this threshold represents a lower bound of what is considered to be a significant amount of learning when S is true. Note in particular that the framework of the hypothesis test, (18) and (19), is frequentist, although we use Bayesian tools (the prior and posterior distributions) to define the test statistic. In order to introduce performance measures of how much the agent has learnt, let Pr x 0 refer to a probabilities when data D ∼ L(·|x 0 ) are generated according to what one expects in the true world. The type I and II errors of the test (18) and (19) are then defined as respectively. Both these error probabilities are functions of x 0 , and they quantify how much the agent has learnt about the truth (cf. Figure 1 for an illustration).  (19) is is the expected value of each density, whereas the error probabilities of type I and II correspond to the areas under the curves in (b) and (a) to the left and right of p, respectively.

The Bayesian Approach to Learning
Within a Bayesian framework, we think of H 0 and H 1 as two different models, A and A c , that represent a subdivision of the parameter space into two disjoints subsets. The posterior odds factor into a product of the prior odds and the Bayes factor. Hypothesis H 1 is chosen whenever for some threshold r. If the cost of drawing a parameter X ∼ P from A (A c ) is C 0 (C 1 ) when H 1 (H 0 ) is chosen, the optimal Bayesian decision rule corresponds to r = C 0 /C 1 . A little algebra reveals that the AIN is a monotone decreasing function of the posterior odds. From this, it follows that the frequentist test (19), with AIN as test statistic, is equivalent to the Bayesian test (22), whenever I = − log[P 0 (A)(1 + r)]. However, the interpretation of the two tests differ. Whereas the aim of the Bayesian decision rule is to minimize an expected cost (or maximize an expected reward/utility), the aim of the frequentist test is to keep the error probabilities of type I and II low.

Test Statistic When x 0 Is Unknown
Recall that the agent may or may not know the set A. In the latter case, the agent cannot determine the value of the test statistic I + (A), and hence he cannot test between H 0 and H 1 himself. This happens, for instance, for the truth function (1), with A = {x 0 }, since the AIN I + (A) = log[p(x 0 )/p 0 (x 0 )] then involves the unknown x 0 , with p(x)dx = dP(x) and p 0 (x)dx = dP 0 (x). Although I + (A) is not known for this particular choice of A, the agent may still use the posterior distribution (7) in order to compute the expected value (conditionally on observed D) of the test statistic according to his posterior beliefs. Note that (23) equals the Kullback-Leibler divergence D KL (P P 0 ) between P and P 0 , or the difference between the cross entropy H(P, P 0 ) between P and P 0 , and the entropy H(P) of P. If we also take randomness of the data set D into account, and make use of (7), it follows that the expected AIN, for the same choice of A, equals the mutual information between X and D, when (X, D) ∼ Q vary jointly according to the agent's Bayesian expectations, and with q(x, d) = dQ(x, d)/d(x, d).

Knowledge Acquisition Goes beyond Learning
As mentioned in the introduction, knowledge acquisition goes beyond learning, since it also requires that a true belief in S is justified (see Figure 2 for an illustration). Figure 2. Illustration of the difference between learning and knowledge acquisition for a scenario with a set of worlds X = [0, 1] and a statement S whose truth function x → f (x) is depicted to the left (a) and right (b). It is assumed that S is true (x 0 ∈ A), and that the degrees of beliefs P 0 of an ignorant person correspond to a uniform distribution on X . The filled histograms correspond to the density functions p(x)dx = dP(dx) of two agent's beliefs. The agent to the left (a) has learnt about S but not acquired knowledge, since x 0 does not belong to the support of P. The agent to the right has not only learnt about S, but also acquired knowledge, since his belief is justified, corresponding to a distribution P that is more concentrated around the true world x 0 , compared to the ignorant person. Hence, the JTB condition is satisfied for the agent to the right, but not for the agent to the left.
It is possible, in principle, for an agent whose probability measure P corresponds to a smaller belief in x 0 compared to that of the ignorant person, to have a value of I + anywhere in the range [−∞, − log P 0 (A)] when S is true (i.e., when x 0 ∈ A). One can think of a case in which the agent will believe in S with certainty (P(A) = 1) if supp(P) ⊂ A; but this belief in S is for the wrong reason if, for instance, the agent does not believe in the true world, i.e., if (10) holds, corresponding to the left part of Figure 2. Another less extreme situation occurs when the agent has a higher belief in A compared to the ignorant person but has lost some (but not all) confidence in the true world with respect to that of the ignorant person; in this case, the agent has not acquired new knowledge about the true world compared to the ignorant person, although he still has learned about S and has some knowledge about the true world.

A Formal Definition of Knowledge Acquisition
Knowledge acquisition is formulated using tools from statistical estimation theory. Loosely speaking, the agent acquires knowledge, based on data D, if the posterior distribution P gets more concentrated around x 0 , compared to an ignorant person. By this we mean that each closed ball centered at x 0 has a probability that is at least as large under P as under P 0 . Closed balls require, in turn, the concept of a metric or distance; that is, a function d : . Some examples of metric are: Equipped with a metric on X , knowledge acquisition is now defined: Definition 2 (Knowledge acquisition and full knowledge). Let B (x 0 ) = {x ∈ X : d(x, x 0 ) ≤ } be the closed ball of radius that is centered at x 0 with respect to some metric d. We say that an agent has acquired knowledge about S (conditionally on observed D) if learning has occurred according to Definition 1, and in order for this learning to be justified, the following two properties are satisfied for all > 0: and with strict inequality for at least one > 0. Full knowledge about S requires that (9) holds; i.e., that the agent with certainty believes that the true world x 0 is true. The agent is expected to acquire knowledge about S if learning is expected to occur, according to Definition 1, and if (25) and (26) hold withP instead of P. The agent is expected to acquire full knowledge about S if (9) holds withP instead of P.
Several remarks are in order.

Remark 6.
Property (25) ensures that x 0 is in support of P ( [39], p. 20) Kallenberg2021a. When P 0 is the uniform distribution (2), (25) follows from (26). Property (26) is equivalent to I + (B (x 0 )) ≥ 0, when P 0 (B (x 0 )) > 0. In this case, the requirement that (26) is satisfied with strict inequality for some = * > 0 is equivalent to learning the proposition S * : "The distance of a world to the true world x 0 is less than or equal to * ," corresponding to a truth function Since the agent does not know x 0 , neither f * nor A * = {x ∈ X ; f * (x) = 1} is known to him, even if he is able to discern between all possible worlds. If f * differs from the original truth function f , learning of S * can be viewed as meta-learning. Note also that A 0 = {x 0 } corresponds to the set in (1).

Remark 7.
Suppose the truth function used to define learning and knowledge acquisition satisfies (27), i.e., f = f for some ≥ 0. Then (25) and (26) are sufficient for knowledge acquisition, since they imply that learning of S = S * has occurred, according to Definition 1. Although knowledge acquisition in general requires more than learning, the two concepts are equivalent for a truth function (1). Indeed, in this case it is not possible to learn whether S = S 0 is true or not for the wrong reason.

Remark 8.
Recall from Definition 1 that an agent has fully learnt S when For a rational agent, the lower part of (28) should hold when data D falsifies S. In general, (28) is a necessary but not sufficient condition for full knowledge. Indeed, it follows from (9) that, for a person to have full knowledge, P(B) = 1(x 0 ∈ B) must hold for all B ∈ F , not only for the set A of worlds for which S is true.

Remark 9.
Suppose a distance measure d(P, Q) between probability distributions on (X , F ) is defined. This gives rise to a different definition of knowledge acquisition, whereby the agent acquires knowledge if has learnt about S and additionally d(P, where the minimum is taken over all random vectors (X 1 , X 2 ) whose marginals have distributions P and Q, respectively. Note in particular that the KL choice of distance yields d(Q, . The corresponding notion of knowledge acquisition is weaker than in Definition 2, requiring (25) and (26) to hold only for = 0.
Illustration 4 (Continuation of Illustration 1). To check whether learning or knowledge acquisition has occurred, according to Definitions 1 and 2, for the student who takes the math home assignment, x 0 must be known. The reader may think of an instructor with full information-an F -optimal measure according to (9)-who checks whether a pupil has learned and acquired knowledge or not. However, in Illustration 1 it is the teacher who is the pupil and learns and acquires knowledge about the skills of a math student. In this context, the instructor is a supervisor of the teacher who knows whether the math student is able to add (x 0 = x 2 ) or not, and in the latter case whether the student gets help (x 0 = x 3 ) or not (x 0 = x 1 ). Whereas the instructor's sigma-algebra is F , the teacher's sigma-algebra G in (12) does not make it possible to discern between x 2 and x 3 . Suppose x 0 = x 2 . No matter how many questions the home exam has, as long as the teacher does not get information from the instructor on whether the student solved the home exam without help or not, although the teacher learns that S is true, since the student scores well, he will never acquire full knowledge that the student knows how to add, since P(x 0 ) = P(x 2 ) = P(x 3 ) ≤ 0.5 < 1.

Learning and Knowledge Acquisition Processes
The previous two sections dealt with learning and knowledge acquisition of a static belief P, corresponding to an agent who is able to discern between worlds according to one sub-σ-algebra G of F , and who has access to one data set D. The setting is now extended to consider an agent who is exposed to an increasing amount of information about (or discernment between) the possible worlds in X , and increasingly larger data sets.

The Process of Discernment and Data Collection
Mathematically, an increased ability to discern between possible worlds is expressed as a sequence of σ-algebras Typically, G k is generated by a random variable Y k whose domain is in X for k = 1, . . . , n.
The σ-algebras in (29) are associated with increasingly larger data sets D 1 , . . . , D n , with D k ∈ D k . Let dQ k (x, D k ) = dP 0 (x)L(D k |x)dD k refer to the joint distribution of the parameter and data in step k, such that the likelihood This implies that an agent who interprets data D k according to this likelihood function has beliefs (represented by the posterior probability measure P k (·) = Q k ( · | D k )) that correspond to not being able to discern events outside of G k better than an ignorant person.
Mathematically, this is phrased as a requirement for all F -measurable functions g : X → R and sigma algebras G k such that G k ⊂ G k ⊂ F , for k = 1, . . . , n. The collection of pairs (D 1 , G 1 ), . . . , (D n , G n ) is referred to as a discernment and data collection process. The active information, after k steps of the discernment and data collection process, is LetP k (·) = E x 0 [P k ( · | D k )] refer to expected degrees of belief after k steps of the information and data collection process, if data D k ∼ L(·|x 0 ) vary according that what one expects in the true world. The corresponding active information is In the following sections we will use the sequences I + 1 , . . . , I + n and P 1 , . . . , P n of AINs and posterior beliefs in order to define different notions of learning and knowledge acquisition.

Strong Learning and Knowledge Acquisition
Definition 3 (Strong learning). The probability measures P 1 , . . . , P n , obtained from the discernment and data collection process represent a learning process in the strong sense (conditionally on observed D 1 , . . . , D n ) if with at least one strict inequality. Learning is expected to occur, in the strong sense, if (33) holds withĪ + 1 (A), . . . ,Ī + n (A), instead of I + 1 (A), . . . , I + n (A).
Definition 4 (Strong knowledge acquisition). With B (x 0 ) as in Definition 2, the learning process is knowledge acquiring in the strong sense (conditionally on observed D 1 , . . . , D n ) if, in addition to (33), we have that this learning process is justified, so that for all > 0, with strict inequality for at least one step of (34) and for at least one > 0. Knowledge acquisition is expected to occur, in the strong sense, if learning is expected to occur in the strong sense, according to Definition 3, and additionally (34) holds withP 1 , . . . ,P n , instead of P 1 , . . . , P n .
Illustration 5 (Continuation of Illustration 1). Assume the teacher of the math student has a discernment and data collection process (G 1 , D 1 ), (G 2 , D 2 ), where in the first step, G 1 = G and D 1 |x i ∼ Bin(10, π i ) are obtained from a home assignment with 10 questions (as described in Section 3.2). Suppose the student knows how to add (x 0 = x 2 ). It can be seen that whenever 7 ≤ d 1 ≤ 10. Assume that in a second step the teacher receives information Z 2 ∈ {0, 1} from the instructor on whether the student used external help (Z 2 = 1) or not (Z 2 = 0) during the exam. Let d 2 = (d 1 , z 2 ) refer to observed data after step 2. The likelihood, after the second step, then takes the form If the instructor correctly reports that the student did not use external help (z 2 = 0), it follows that We deduce from (35) and (36) that which suggests that knowledge acquisition has occurred if the categorical space metric d(x i , x j ) = 1(x i = x j ) is used on X . However, since P 2 (A) < P 1 (A), neither learning nor knowledge acquisition in the strong sense has occurred. The reason is that the information from the instructor (that the student has not cheated) makes the teacher less certain as to whether the student is able to score well on the test. On the hand, if we change the proposition to S: "The student knows how to add," with A = {x 2 }, then strong learning and knowledge acquisition has occurred because of (37), since P k (A) = P k (x 0 ) for k = 0, 1, 2.

Weak Learning and Knowledge Acquisition
Learning and knowledge acquisition are often fluctuating processes, and the requirements of Definition 3 are sometimes too strict. Accordingly, weaker versions of learning and knowledge acquisition are thus introduced.
Definition 5 (Weak learning). Learning in the weak sense has occurred at time n (conditionally on observed D n ), if Learning is expected to occur in the weak sense if (38) holds withĪ + n instead of I + n .
Definition 6 (Weak knowledge acquisition). Knowledge acquisition in the weak sense occurs (conditionally on observed D n ) if, in addition to the weak learning condition (38), in order for this learning to be justified, it holds for all > 0 that P n (B (x 0 )) > 0 and with strict inequality for at least one > 0. Knowledge acquisition is expected to occur in the weak sense if weak learning occurs according to Definition 5 and (39) holds withP n instead of P n .

Asymptotics
Strong and weak learning (or strong and weak knowledge acquisition) are equivalent for n = 1. The larger n is, the more restrictive strong learning becomes in comparison to weak learning. However, for large n, neither strong nor weak learning (knowledge acquisition) are entirely satisfactory entities. For this reason, in this section we will introduce asymptotic versions of learning and knowledge acquisition, for an agent whose discernment between worlds and collected data sets increase over a long period of time.

Asymptotic Learning and Knowledge Acquisition
In order to define asymptotic learning and knowledge acquisition, as the number of steps n of the discernment and data collection process tends to infinity, we first need to introduce AIN versions of limits. Define and when the two limits of (40) agree, we refer to the common value as I + lim (B). Define alsō withĪ + lim (B) the common value whenever the two limits of (42) agree. Since I + lim (B) only exists when I + lim inf (B) = I + lim sup (B), and I + lim inf (B) ≤ I + lim sup (B), the following definitions of asymptotic learning and knowledge acquisition are natural:

Definition 7 (Asymptotic learning). Learning occurs asymptotically (conditionally on the ob-
Full learning occurs asymptotically (conditionally on {D k } ∞ k=1 }) if Learning is expected to occur asymptotically if (44) holds withĪ + lim sup andĪ + lim inf , instead of I + lim sup and I + lim inf , respectively. Full learning is expected to occur asymptotically, if (45) holds withĪ + lim instead of I + lim .

Definition 8 (Asymptotic knowledge acquisition).
Knowledge acquisition occurs asymptotically (conditionally on {D k } ∞ k=1 ) if, in addition to the asymptotic learning condition (44), in order for this asymptotic learning to be justified, for every > 0, it holds that with strict inequality for a least one > 0. Full knowledge acquisition occurs asymptotically (conditionally on {D k } ∞ k=1 }) if (45) holds and is satisfied for all > 0. If learning is expected to occur asymptotically according to Definition 7, and if (46) holds withĪ + lim inf instead of I + lim inf , then knowledge acquisition is expected to occur asymptotically. Full knowledge acquisition is expected to occur asymptotically if full learning is expected to occur asymptotically according to Definition 7, and if (47) holds withĪ + lim instead of I + lim .

Bayesian Asymptotic Theory
In this subsection we will use Bayesian asymptotic theory in order to quantify and give conditions for when asymptotic learning and knowledge acquisition occur. Let Ω be a large space that incorporates prior beliefs and data for all k = 1, 2, . . .. Define X k : Ω → X as a random variable whose distribution corresponds to the agent's posterior beliefs, based on data set D k , which itself varies according to another random variable D k : Ω → D k with distribution D k ∼ L(·|x 0 ). Let Pr x 0 be a probability measure on subsets of Ω that induces distributions X k |D k ∼ P k and X k ∼P k , respectively. The following proposition is a consequence of Definitions 7 and 8: Proposition 1. Suppose full learning is expected to occur asymptotically, in the sense of (45), with I + lim instead of I + lim . Then, as k → ∞. In particular, the type I and II errors of the hypothesis test (18) and (19), with threshold I = log[p/P 0 (A)] for some 0 < p < 1, satisfy respectively, as k → ∞. If full knowledge acquisition occurs asymptotically, in the sense of (47), as k → ∞, with p −→ referring to convergence in probability. If full knowledge acquisition is expected to occur asymptotically, in the sense of Definition 8, then as k → ∞. (50) is closely related to the notion of posterior consistency [40]. For our model, the latter concept is usually defined as

Remark 10. Full asymptotic knowledge acquisition
where the probability refers to variations in the data sequence {D k } ∞ k=1 when x 0 holds. Thus, posterior consistency (52) means that full asymptotic knowledge acquisition (50) holds with probability 1. Let L(X) refer to the distribution of the random variable X. Then, (52) is equivalent to as k → ∞, with a.s.
−→ referring to almost sure weak convergence with respect to variations in the data sequence {D k } ∞ k=1 when x = x 0 . On the other hand, it follows from Definition 8 that if full knowledge acquisition is expected to occur asymptotically, this is equivalent to as k → ∞, which is a weaker concept than posterior consistency, since almost sure weak convergence implies weak convergence in probability. However, sometimes (54), rather than (52) and (53), is used as a definition of posterior consistency.
Remark 11. It is sometimes possible to sharpen (54) and obtain the rate at which the posterior distribution converges to δ x 0 . The posterior distribution is said to contract at rate k → 0 to δ x 0 as k → ∞ (see for instance [41]), if for every sequence M k → ∞ it holds that when {D k } ∞ k=1 varies according to what one expects in the true world x 0 . Since convergence in probability is equivalent to convergence in mean for bounded random variables, it can be seen that (55) and (56), we found that a contraction of the posterior towards δ x 0 at rate k is equivalent to expecting full knowledge acquisition asymptotically at rate k .
It follows from Proposition 1 and Remarks 10 and 11 that Bayesian asymptotic theory can be used, within our frequentist/Bayesian framework, to give sufficient conditions for asymptotic learning and knowledge acquisition to occur. Suppose, for instance, that D k = (Z 1 , . . . , Z k ) is a sample of k independent and identically distributed random variables Z l with distribution Z l ∼ F(· | x 0 ) that belongs to the statistical model {F(· | x); x ∈ X }. The likelihood function is then a product of the likelihoods of all observations Z l . For such a model, a number of authors [40,[42][43][44][45] have provided sufficient conditions for posterior consistency (52) and (53) to occur. It follows from Remark 10 that these conditions also imply the weaker concept (54) of full, expected knowledge acquisition to occur asymptotically. Suppose (57) holds with a parameter space X ⊂ R q that is a subset of Euclidean space of dimension q. It is possible then to obtain the rate (56) at which knowledge acquisition is expected to occur. The first step is to use the Bernstein-von Mises theorem, which under appropriate conditions (see for instance [46]) approximates the posterior distribution P k = L(X k | D k ) by a normal distribution centered around the maximum likelihood (ML) estimatorx of x 0 . More specifically, this theorem provides weak convergence as k → ∞, of a re-scaled version of the distribution of X k |D k when {D k } ∞ k=1 varies according what one expects when x = x 0 . The limiting distribution is a q-dimensional normal distribution with mean 0 and a covariance matrix that equals the inverse of the Fisher information matrix J(x 0 ), evaluated at the true world parameter x 0 . On the other hand, the standard asymptotic theory of maximum likelihood estimation (see for instance [47]) implies as k → ∞, with weak convergence referring to variations in the data sequence {D k } ∞ k=1 when x = x 0 . Combining equations (59) and (60), we arrive at the following result: consists of independent and identically distributed random variables {Z l } ∞ l=1 , and that the Bernstein-von Mises theorem (59) and asymptotic normality (60) of the ML-estimator hold. Then, X k converges weakly towards x 0 at rate 1/ √ k, in the sense that as k → ∞. In particular, full knowledge acquisition is expected to occur asymptotically at rate , defined for all q-dimensional vectors s = (s 1 , . . . , s q ) ∈ R q . Let also t = (t 1 , . . . , t q ) ∈ R q and denote the distribution function of N 0, 2J(x 0 ) −1 by G. Combining (59) and (60), and making use of the fact that the convolution of two independent N 0, J(x 0 ) −1 -variables is distributed as N 0, 2J(x 0 ) −1 , we can find that as k → ∞, with s ≤ t referring to s j ≤ t j for j = 1, . . . , q. Since (62) holds for any t ∈ R q , Equation (61) follows. Moreover, in view of (56), Equation (61) implies that full knowledge acquisition is expected to occur asymptotically at rate 1/ √ k.
In general, the conditions of Theorem 1 typically require that data, and the agent's interpretation of data, are unbiased. When these conditions fail (cf. Remark 2), there is no guarantee that knowledge acquisition is expected to occur asymptotically as k → ∞.

Examples
Example 1 (Coin tossing). Let x 0 ∈ X = [0, 1] be the probability of heads when a certain coin is tossed. An agent wants to find out whether the proposition S : the coin is symmetric with margin ε > 0 is true or not. This corresponds to a truth function f (x) = 1(x ∈ A), with A = [0.5 − ε, 0.5 + ε], that is known to the agent. Suppose the coin is tossed a large number of times (n = ∞), and let D k = (Z 1 , . . . , Z k ) ∈ D k = {0, 1} k be a binary sequence of length k that represents the first k tosses, with tails and heads corresponding to 0 (Z k = 0) and 1 (Z k = 1), respectively. The number of heads M k = ∑ k l=1 Z l ∼ Bin(k, x 0 ) after k tosses is then a sufficient statistic for estimating x 0 based on data D k . Even though {D k } is an increasing sequence of data sets, we put Y k (x) = x and G k = F = B([0, 1]), the Borel σ-algebra on [0, 1], for k = 1, 2, . . .. Let P 0 be the uniform prior distribution on [0, 1]. Since the uniform distribution is a beta distribution, and beta distributions are conjugate priors to binomial distributions, it is well known [17] that the posterior distribution belongs to the beta family as well. Consequently, if X k is a random variable that reflects the agent's degree of belief in the probability of heads after k tosses, it follows that his belief in a symmetric coin, if M k = m, is is the posterior density function of the parameter x, whereas B(a, b) is the beta function and x → L(m | x) the likelihood function. From this, it follows that the AIN after k coin tosses with m heads and k − m tails equals Since data are random, P k (A) (and hence also I + k (A)) will fluctuate randomly up and down with probability one (see Figure 3); for this reason, {P k } ∞ k=1 does not represent a learning process in the strong sense of Definition 3. On the other hand, it follows by the strong law of large numbers that M k /k a.s.
−→ x 0 as k → ∞, and from properties of the beta distribution, this implies that full learning and knowledge acquisition occur asymptotically according to Definitions 7 and 8, with probability 1. In view of Remark 10, we also have posterior consistency (52) and (53).  There is no strong learning because the belief oscillates. However, there is weak learning after a few coin tosses. In particular, when the number of coin tosses is 1000, there is weak learning since P 1000 (A) > P 0 (A) and I + 1000 (A) > 0.
By analyzingP k instead of P k , we may also assess whether learning and knowledge acquisition are expected to occur. The expected degree of belief in a symmetric coin, after k tosses, is is the expected posterior density function of x, after k tosses of the coin. Note in particular that It can be shown that (63) and the weak law of large numbers (M k /k p −→ x 0 as k → ∞, where p −→ refers to convergence in probability) lead to uniform convergence as k → ∞ for any > 0. The last four displayed equations implyP k (A) → 1(x 0 ∈ A) and P k p −→ x 0 as k → ∞. This and Definitions 7 and 8 imply that full learning and knowledge acquisition are expected to occur asymptotically. This result is also a consequence of posterior consistency, or of Theorem 1. Notice, however, that a purely Bayesian analysis does not allow us to conclude that knowledge acquisition occurs, or is expected to occur, asymptotically. , where 0 ≤ a < b ≤ 1 are known. If X corresponds to a fairly short time interval of the past, it is reasonable to assume that P 0 has a uniform distribution on (0, 1].
In the first step of the learning process, suppose radiometric dating D 1 = Z 1 of a burial find from the ancestor appears. If δ = 1/N represents the precision of this dating method, the corresponding σ-algebra is where Y 1 : X → {1, . . . , N} is defined through Y 1 (x) = [x/δ] + 1, and where [x/δ] is the integer part of x/δ. Due to (3), it follows that P 1 has a density function for some non-negative probability weights p 1i ≥ 0 that sum to 1. Since p 1i = p 1i (D 1 ), this measure is constructed from the radiometric dating data D 1 of the burial find from the ancestor. The G 1 -optimal probability measure is obtained from (8) as It corresponds to dating the time of death of the ancestor correctly, given the accuracy of this dating method. On the other hand, if the radiometric dating equipment has a systematic error of −δ, a truth-excluding probability measure (10) is obtained with In the second step of the learning process, suppose data D 2 = (Z 1 , Z 2 ) is extended to include a piece of text Z 2 from a book where the time of death of the ancestor can be found. This extra source of information increases the σ-algebra to G 2 = F = B ([0, 1]), and if the contents of the book are reliable, P 2 = δ x 0 is F -optimal. It follows from Definition 3 that strong learning has occurred if Na = i a and Nb = i b are integers and Figure 4 illustrates another scenario where not only strong learning but also strong knowledge acquisition occurs. Suppose now that (66) holds, with i a + 1 ≤ i 0 − 1 ≤ i b . If P 2 = δ x 0 , the strong learning condition (67) is satisfied, and the weak knowledge acquisition requirement of Definition 6 holds as well. Strong knowledge acquisition has not occurred though, since p 1i 0 = 0 means that Equation (34) of Definition 4 (with n = 2) is violated for sufficiently small > 0. Note in particular that these conclusions about knowledge acquisition cannot be drawn from a purely Bayesian analysis. . Posterior densities p 1 (x) and p 2 (x) after one and two steps of the discerment and data collection process of Example 2 when S is true (x 0 ∈ [a, b]). Since p 1 is measurable with respect to G 1 , it is piecewise-constant with step length δ. Note that strong learning and knowledge acquisition occurs.
Assume now that the contents of the book are not reliable. A probability measure P 2 on [0, 1] may be chosen so that it incorporates data Z 1 from the radiometric dating and data Z 2 from the book. This probability measure will also include information about the way the text of the book is believed to be unreliable. If the agent trusts Z 2 too much, it may happen that strong learning does not occur.

Example 3 (Future events).
A youth camp with certain outdoor activities is planned for a weekend. Let X = (0, 1] 2 denote the set of possible temperatures x = (x 1 , x 2 ) of the two days for which the camp is planned, each normalized within a range 0 ≤ x i ≤ 1. The outdoor activities are only possible within a certain sub-range 0 < a ≤ x 1 , x 2 ≤ b < 1 of temperatures. The proposition S : it is possible to have the outdoor activities corresponds to a truth function f (x) = 1 x ∈ [a, b] 2 and A = [a, b] 2 . The leaders have to travel to the camp five days before it starts and then make a decision on whether to bring equipment for the outdoor activities or for some other indoor activities. In the first step they consult weather forecast data D 1 = Z 1 , with a σ-algebra G 1 given by which is σ(Y 1 ), the σ-algebra generated by Y 1 , where δ 1 and δ 2 > δ 1 represent the maximal possible accuracy of weather forecasts five and six days ahead, respectively, N i = 1/δ i and Y 1 (x) = ([x 1 /δ 1 ] + 1, [x 2 /δ 2 ] + 1). Let P 0 be the uniform distribution on [0, 1] 2 . Due to (3), P 1 has a density function for some non-negative probability weights p 1ij = p 1i (D 1 ) ≥ 0 that sum to 1, with a rectangular region that corresponds to the ith temperature interval the first day of the camp and the jth temperature interval the second day. Consequently, the accuracy G 1 of weather forecast data forces p 1 to be constant over each R ij . A G 1 -optimal measure assigns full weight 1 to the rectangle R ij with i = [x 01 /δ 1 ] + 1 and j = [x 02 /δ 2 ] + 1, where x 0 = (x 01 , x 02 ) represents the actual temperature the two days. Observe then that the G 1 -optimal measure is restricted to measurements that are accurate up to δ 1 and δ 2 ; therefore, it cannot do better than assigning the temperature to the intervals with sizes δ 1 and δ 2 to which the actual temperatures belong; however, it cannot say what the exact temperature is. The exact prediction requires an F -optimal measure. In a second step, in order to get some additional information, the leaders of the camp consult a prophet. Let P 2 refer to the probability measure based on the weather forecast Z 1 and the message Z 2 of the prophet, so that D 2 = (Z 1 , Z 2 ) and G 2 = F . If the prophet always speaks the truth, and if the leaders of the camp rely on his message, they will make use of the F -optimal measure P 2 = δ x 0 , corresponding weak (and full) learning, and a full amount of knowledge. In general, the camp leaders' prediction in step k is correct with If this probability is less than 1 for k = 2, the reason is either that the prophet does not always speak the truth or that the leaders do not rely solely on the message of the prophet. In particular, it follows from Definition 3 that strong learning has occurred if Suppose the weather forecast and the message of the prophet are biased, but they still correctly predict whether outdoor activities are possible or not. Then, neither weak nor strong knowledge acquisition occurs, in spite of the fact that the strong learning condition (69) holds. Note in particular that such a conclusion is not possible with a purely Bayesian analysis. Another scenario wherein neither (weak or strong) learning nor knowledge acquisition takes place is depicted in Figure 5. Posterior densities p 1 (x 1 , x 2 ) and p 2 (x 1 , x 2 ) after one and two steps of data collection for Example 3. Since x 01 < a, it is not possible to have outdoor activities the first day of the camp. The weather forecast density p 1 is supported and piecewise-constant on the four rectangles with width δ 1 and height δ 2 , corresponding to σ-algebra G 1 . The true temperatures (x 01 , x 02 ) are within the support of p 1 . On the other hand, the prophet incorrectly predicts that outdoor activities are possible both days; p 2 is supported on the ellipse. In this case, neither (weak or strong) learning nor knowledge acquisition takes place.

Example 4 (Replication of studies)
. Some researchers want to find the prevalence of the physical symptoms of a certain disease. Let X = [0, 1] 2 refer to the possible set of values x = (x 1 , x 2 ) for the prevalence of the symptoms, obtained from two different laboratories. The first value corresponds to the prevalence obtained in Laboratory 1, whereas the second value x 2 is obtained when Laboratory 2 tries to replicate the study of Laboratory 1. The board members of the company to which the two laboratories belong want to find out whether the two estimates are consistent, within some tolerance level 0 < ε < 1. In that case, the second study is regarded as replication of the first one. The proposition S : the second study replicates the first one corresponds to a truth function f (x) = 1(|x 2 − x 1 | ≤ ε) and The true value x 0 = (x 01 , x 02 ) represents the actual prevalences of the symptoms, obtained from the two laboratories under ideal conditions. Importantly, it may still be the case that x 01 = x 02 , if either the prevalence of the symptoms changes between the two studies and/or the two laboratories estimate the prevalences within two different subpopulations.
Let D 2 be a data set by which Laboratory 2 receives all needed data from Laboratory 1 in order to set up its study properly (so that, for instance, affection status is defined in the same way in the two laboratories ). We will assume Y 2 (x 1 , x 2 ) = (x 1 , x 2 ), so that the corresponding σ-algebra corresponds to full discernment, with B(X ) being the Borel σ-algebra on X , whereas B 0 = B((0, 1]) is the Borel σ-algebra on the unit interval (see Remark 1). If P 2 is the probability measure obtained from D 2 , the probability of concluding that the second study replicated the first is In particular, when each laboratory makes use of data from all individuals in its subpopulation (which is either the same or not for the two laboratories), the F -optimal probability measure (9) corresponds to Now consider another scenario where Laboratory 2 only gets partial information from Laboratory 1. This corresponds to a data set D 1 with the same sampled individuals as in D 2 , but Laboratory 2 has incomplete information from Laboratory 1 regarding the details of how the first study was set up. For this reason, they make use of a coarser σ-algebra, by which it is only possible to quantify prevalence with precision δ. If this σ-algebra is referred to as B δ ⊂ B 0 , it follows that The corresponding loss of information is measured through a probability P 1 that has the same marginal distribution as P 2 for all events B that are discernible from G 1 , i.e., Hence, it follows from (30) and (73) that where N = 1/δ, p 1j (x 1 ) = P 2 X 2 ∈ R j | X 1 = x 1 , and R j = ((j − 1)δ, jδ] is the j-th possible region for the prevalence estimate of Laboratory 2. In particular, the probability that the second study replicates the first one is If both laboratories perform a screening and collect data from all individuals in their regions, so that (72) holds, then P 1 is a G 1 -optimal measure according to (8), with and j 0 = [x 02 /δ] + 1. Making use of Definition 3, we notice that a sufficient condition for strong learning to occur is that P 0 has a uniform distribution on X (so that P 0 ([a, b] 2 ) = 2ε − ε 2 ), such that (72) and (75) hold, and With full information transfer between the two laboratories, the replication probabilities (71) and (72) based on data D 2 only depend on ε, whereas the corresponding replication probabilities (74) and (75) under incomplete information transfer between the laboratories and data D 1 , also depending on δ. In particular, P 1 (A) will always be less than 1 when 2ε < δ, even when (75) holds and x 01 = x 02 . Moreover, δ sets the limit in terms of how much knowledge can be obtained from the two studies under incomplete information transfer, since P 1 (B (x 0 )) < 1, for all 0 < < δ.
Note that this last conclusion cannot be obtained from a Bayesian analysis, since a true pair x 0 of prevalences does not belong to a purely Bayesian framework.
Example 5 (Unmeasured confounding and causal inference). This example illustrates unmeasured confounding and causal inference. Let q = n and X = {0, 1} n . An individual is assigned a binary vector x = (x 1 , . . . , x n ) of length n, where x n ∈ {0, 1} codes for whether that person will have symptoms within five years (x n = 1) or not (x n = 0) that are associated with a certain mental disorder. The first component x 1 ∈ {0, 1} refers to the individual's binary exposure, whereas the other variables x k ∈ {0, 1}, k = 2, . . . , n − 1 are binary confounders. The truth function f (x) = x n corresponds to symptom status, whereas represents the vectors x of all individuals in the population with symptoms. Consider the proposition S : Adam will have the symptoms within five years, and let x 0 = (x 01 , . . . , x 0n ) be the vector associated with Adam. We will introduce a sequence of probability measures P 0 , P 1 , . . . , P n , where P 0 represents the distribution of X = (X 1 , . . . , X n ) in the whole population, whereas P k corresponds to the conditional distribution of X ∼ P 0 , given that its first k covariates D k = (Z 1 , . . . , Z k ) = (x 01 , . . . , x 0k ) ∈ D k = {0, 1} k have been observed, with values equal to those of Adam's first k covariates. Since the conditional distribution D k |x 0 is non-random, it follows that for k = 0, 1, . . . , n − 1, whereasP n = P n = δ x 0 for k = n. According to Definition 5, this implies that weak learning occurs with probability 1, and in particular that weak learning is expected to occur. If Y k (x 1 , . . . , x n ) = (x 1 , . . . , x k ), we have that for k = 0, . . . , n. Note, in particular, that P k is G k -optimal, corresponding to error-free measurement of Adam's first k covariates. In order to specify the null distribution P 0 , we assume that a logistic regression model [48] P 0 (X n = 1 | x 1 , . . . , holds for the probability of having the symptoms within five years, conditionally on the n − 1 covariates (one exposure and n − 2 confounders). It is also assumed that the regression parameters β 0 , . . . , β n−1 are known, so that g is known as well. It follows from Equations (76) and (78) that can be interpreted as increasingly better predictions of Adam's symptom status five years ahead, for k = 0, 1, . . . , n − 1, whereas P n (A) = f (x 0 ) = x 0n represents full knowledge of S. In particular, P 0 (A) is the prevalence of the symptoms in the whole population, whereas P 1 (A) = g 1 (x 01 ) is Adam's predicted probability of having the symptoms when his exposure x 01 is known, whereas none of his confounders are measured. Suppose x 2 , . . . , x n−1 are sufficient for confounding control, and that the exposure and the confounders (in principle) can be assigned. Let x 0 = (x 01 , . . . , x 0n ) represent a hypothetical individual for which all covariates are assigned. Under a so called conditional exchangeability condition [16], it is possible to use a slightly different definitioñ of the probability measures in order to compute the counterfactual probability of the potential outcome X n = 1, under the scenario that the first k covariates were set to x 01 , . . . , x 0k . In particular, it is of interest to know how much the unknown causal risk ratio effect h 1 (1)/h 1 (0) of the exposure maximally differs from the known risk ratio g 1 (1)/g 0 (0) [49][50][51][52]. Note in particular that the corresponding logged quantities can be expressed in terms of the active information = log[E P 0 (P 0 (X n = 1 | x 01 , X 2 , . . . , X n−1 ))/P 0 (X n = 1)].

Discussion
In this paper, we studied an agent's learning and knowledge acquisition within a mathematical framework of possible worlds. Learning is interpreted as an increased degree of true belief, whereas knowledge acquisition additionally requires that the belief is justified, corresponding to an increased belief in the correct world. The theory is put into a framework that involves elements of frequentism and Bayesianism, with possible worlds corresponding to the parameters of a statistical model, where only one parameter value is true, whereas the agent's beliefs are obtained from a posterior distribution. We formulated learning as a hypothesis test within this framework, whereas knowledge acquisition corresponds to consistency of posterior distributions. Importantly, we argue that a hybrid frequentist/Bayesian approach is needed in order to model mathematically the way in which philosophers distinguish learning from knowledge acquisition.
Some applications of our theory were provided in the examples of Section 8. Apart from those, we argue that our framework has quite general implications for machine learning, in particular, supervised learning. A typical task of machine learning is to obtain a predictor of a binary outcome variable Y = f (x 0 ), when only incomplete information X of x 0 is obtained from training data. The performance of a machine learning algorithm is typically assessed in terms of prediction accuracy, that is, how well f (X) approximates Y, with less focus on the closeness between X and x 0 . In our terminology, the purpose of machine learning is learning rather than knowledge acquisition. This can often be a disadvantage, since knowledge acquisition often provides deeper insights than learning. For instance, full knowledge acquisition may fail asymptotically when k → ∞, even when data are unbiased and interpreted correctly by the agent, if there is lacking discernment between the set of possible worlds X , even in the limit k → ∞.
On the other hand, it makes no sense to go beyond learning for game theory, where the purpose is to find the optimal strategy (an instance of knowledge-how). In more detail, let x ∈ X = {0, . . . , M − 1} refer to the strategy x of a player among a finite set of M possible strategies. The optimal strategy x 0 is the one that maximizes an expected reward function R(x) for the actions taken with strategy x, D refers to data from previous games that a player makes use of to estimate R(·), and G represents the player's maximal possible discernment between strategies. Since the objective is to find the optimal strategy, it is natural to use a truth function with the associated set A = A 0 = {x 0 } of true worlds corresponding to the upper row of (1). It follows from Remark 7 that learning and knowledge acquisition are equivalent for game theory whenever (80) is used. Various algorithms, such as reinforcement learning [53] and sequential sampling models [54,55], could be used by a player in order to generate his beliefs P about which strategy is the best. Many extensions of our work are possible. A first extension would be to generalize the framework of Theorem 1 and Example 1, where data {D k = (Z 1 , . . . , Z k )} n k=1 are collected sequentially according to a Markov process with increasing state space, without requiring that {Z l } n l=1 are independent and identically distributed. We will mention two related models for which this framework applies. For both of these models a student's mastery of q skills (which represent knowledge how rather than knowledge that) is of interest. More specifically, x = (x 1 , . . . , x q ) is a binary sequence of length q, with x i = 1 or 0 depending on whether the student has acquired skill i or not, whereas D k corresponds to exercises that are given to a student up to time k, and the student's answers to these exercises. It is also known which skills are required to solve each type of exercise. The first model is Bayesian knowledge tracing (BKT) [56], which has recently been analyzed using recurrent neural networks [1]. In BKT, a tutor trains the student to learn the q skills, so that the student's learning profile changes over time. At each time point, the tutor is free to choose the last exercises at time k based on previous exercises and what the student learnt up to time k − 1. The goal of the tutoring is to reach a state x 0 = (1, . . . , 1) where the student has learned all skills. The most restrictive truth function (80) monitors whether the student has learned all skills or not, so that P k (A) is the probability that the student has learnt all skills at time k. In view of Remark 7, there is no distinction between learning and knowledge acquisition for such a truth function. A less restrictive truth function f (x) = x i focuses on whether the student has learnt skill i or not, so that P k (A) is the probability that the student learnt skill i at time k. The second model-the Bayesian version of Diagnostic Classification Models (DCMs) [57]-can be viewed as an extension of Illustration 1. The purpose of DCMs is not to train the student (as for knowledge tracing), but rather to diagnose the student's (or respondent's) current vector x 0 = (x 01 , . . . , x 0q ), where x 0i = 1 or 0 if this particular student masters skill (or attribute) i or not. The exercises of DCM are usually referred to as items. Assume a truth function (80); P k (A) is the probability that the diagnostic test by time k has learnt which attributes the student masters. Note in particular that the student's attribute mastery profile x 0 is fixed, and it is rather the instructor that learns about x 0 when the student is being tested on new items.
A second extension would be to consider opinion making and consensus formation [58] for a whole group of N agents that are connected according to some social network. In this context, G k represents the maximal amount of discernibility between possible worlds that is possible to achieve after k time steps based on external data (available to all agents) and information from other agents (which varies between agents and depends on properties of the social network). It is of interest in this context to study the dynamics of {P ki (A)} N i=1 over time, where P ki (A) represents the belief of agent (or individual) i in proposition S after k time steps. This can be accomplished using a dynamical Bayesian network [59] with N nodes i = 1, . . . , N that represent individuals, associating each node i with a distribution P ki over the set of possible worlds X , corresponding to the beliefs of agent i at time k. A particularly interesting example in this vein would be to explore the degree to which social media and social networks can influence learning and knowledge acquisition.
The third possible extension is related to consensus formation, but with a more explicit focus on how N decentralized agents make a collective decision. In order to illustrate this, we first describe a related model of cognition in bacterial populations. Marshall [60] has concluded that "the direction of causation in biology is cognition → code → chemicals". Cognition is observed when there is a discernment and data collection process that either optimizes code or improves the probability of a given chemical outcome. Accordingly, the strong learning process of Definition 3 can be used to model how biological cognition is attained (or at least is expected to be attained). For instance, in quorum sensing, once bacteria reach a critical density, they emit a chemical signal to ascertain the number of neighboring bacteria [61]; when a critical density is reached, the population performs certain functions as a unit (Table 1 of [62] presents several examples of bacterial functions partially controlled by quorum sensing). The proposition under consideration here is S: "the function is performed by at least a fraction ε of bacteria", where ε represents a critical density above which the bacteria act as a unit. The parameter x = (x 1 , . . . , x N ) is a binary sequence reflecting the way in which a population of N = q bacteria acts, so that x i = 1 if bacterium i performs the function, whereas f (x) = 1(x ∈ A) = 1(∑ i x i ≥ εN). For collective decisions, x i rather represents a local decision of agent i, whereas f (x) corresponds to the global decision of all agents. Learning about S at time k = 1, 2, . . . is described by P k (A), the probability that the population acts as a unit at time k. There is a phase transition at time k if the probabilities P 1 (A), . . . , P k−1 (A) of the population acting as a unit are essentially null, whereas P k (A) becomes positive (and hence I + k (A) gets large) when discernment ability and data are extended from (G k−1 , D k−1 ) to (G k , D k ). This is closely related to the fine-tuning of biological systems [33,35] with f being a specificity function and A set of highly specified states, and fine-tuning after k steps of an algorithm that models the evolution of the system corresponding to I + k (A) being large. As for the direction of causation from cognition to code, Kolmogorov's complexity, which measures the complexity of an outcome as the shortest code that produces it, can be used in place of or jointly with active information to measure learning [63].
A fourth theoretical extension is to consider the case |X | = ∞. In this case, instead of the (discrete or continuous) uniform distribution given by (2), it will be necessary to consider more general maximum entropy distributions P 0 , subject to some restrictions, in order to measure learning and knowledge acquisition [20,[64][65][66].
A fifth extension is to consider models where the data sets D k are not nested. This is of interest, for instance, in Example 5, when non-nested subsets of confounders are used to predict Adam's disease status. For such scenarios, it might be preferable to use informationbased model selection criteria (such as maximizing AIN) in order to quantify learning [67], rather than sequentially testing various pairs of nested hypotheses by means of in order to assess whether learning has occurred in each step k (corresponding to strong learning of Definition 3). A sixth extension would be to compare the proposed Bayesian/frequentist notions of learning and knowledge acquisition, with purely frequentist counterparts. Since learning corresponds to choosing between the two hypotheses in (18), we may consider a test that rejects the null hypothesis when the log likelihood ratio is small enough, or equivalently, when for some appropriately chosen threshold t. The frequentist notion of learning is then formulated in terms of error probabilities of type I and II, analogously to (20), but for the LR-test (81) rather than the Bayesian/frequentist test (19) with test statistic AIN, or the purely Bayesian approach that relies on posterior odds (21). A frequentist version of knowledge acquisition corresponds to using data D in order to produce a one-dimensional class of confidence regions CR for x 0 , with a nominal coverage probability of CR that varies. In order to quantify how much knowledge that is acquired, it is possible to use the steepness of a curve that plots the actual coverage probability P(x 0 ∈ CR) as a function of the volume |CR|. However, a disadvantage of the frequentist versions of learning and knowledge acquisition is that they do not involve degrees of beliefs, the philosophical starting point of this article. This is related to the critique of frequentist hypothesis testing offered in [68].
Since no prior probabilities are allowed, within a frequentist setting, important notions such as the false report probability (FRP) and true report probability (TRP) are not computable, leading to many non-replicated findings. A seventh extension is to consider multiple propositions S 1 , . . . , S m , as in [69,70]. For each possible world x ∈ X , we let f : X → {0, 1} m be a truth function such that f (x) = ( f 1 (x), . . . , f m (x)), with f i (x) = 1 (0) if S i is true (false) in world x. It is then of interest to develop a theory of learning and knowledge acquisition of these m propositions. To this end, for each y = (y 1 , . . . , y m ) ∈ {0, 1} m , let A y = {x ∈ X ; f (x) = y} refer to the set of worlds for which the truth value of S i is y i for i = 1, . . . , m. Learning is then a matter of determining which A y is true (the one for which x 0 ∈ A y ), whereas justified true beliefs in S 1 , . . . , S m amount to finding x 0 as well. Learning of statements such as S i ∨ S j and S i ∧ S i can be addressed using the m = 1 theory of this paper, since they correspond to binary-valued truth functionsf (