Active Data Selection and Information Seeking

: Bayesian inference typically focuses upon two issues. The first is estimating the parameters of some model from data, and the second is quantifying the evidence for alternative hypotheses— formulated as alternative models. This paper focuses upon a third issue. Our interest is in the selection of data—either through sampling subsets of data from a large dataset or through optimising experimental design—based upon the models we have of how those data are generated. Optimising data-selection ensures we can achieve good inference with fewer data, saving on computational and experimental costs. This paper aims to unpack the principles of active sampling of data by drawing from neurobiological research on animal exploration and from the theory of optimal experimental design. We offer an overview of the salient points from these fields and illustrate their application in simple toy examples, ranging from function approximation with basis sets to inference about processes that evolve over time. Finally, we consider how this approach to data selection could be applied to the design of (Bayes-adaptive) clinical trials.


Introduction
When we look at the world around us, we are implicitly engaging in a form of active data sampling (also known as active sensing or active inference [1][2][3][4][5]).Despite the abundance of visual data available to us at any one moment, our visual systems are adapted to select (or foveate) only a small portion of these data.The advantage of this is that our brains can scale the processes that underwrite perceptual inference to large sensory datasets.All that is needed is an efficient means of sequentially selecting those data to best optimise our inferences about their causes [6,7].
This paper suggests that the same process of active sampling can be (and often is) used in situations in which there is a cost associated with collecting new data or analysing a very large dataset.Figure 1 illustrates the basic idea, akin to perception-action cycles in biology [8], of alternating between drawing inferences from the data we have available and selecting new data based on these inferences.This cycle is implicitly part of the scientific process, in which our pre-existing understanding is used to motivate experiments whose results update our understanding to motivate future experiments, and so on.
Within the schematic shown in Figure 1, we highlight the key quantities in this cyclical process.They include the causes of data (which may include model parameters or hidden states) (θ) and the data (y).Bayesian inference involves taking our prior beliefs about causes p(θ), combining these with the likelihood (p(y|θ)) of observed data, and arriving at a posterior belief (p(θ|y)) as to the causes given the data.The priors and likelihoods together comprise what is known as a generative model, which can be represented in a number of ways, including as the Bayesian network shown in the centre of the graphic.Here, an arrow from one variable to another indicates a conditional probability distribution of the latter given the former.The active sampling cycle.This graphic illustrates the basic idea behind this paper.Analogous with action-perception cycles of the sort found in biological systems, it shows reciprocal interactions between the process of sampling data (in biology, through acting upon the world to solicit new sensations) and drawing inferences about these data (akin to perceptual inference).
Careful data selection is especially important when we consider the problems associated with very large datasets of the sort that are now ubiquitous in machine learning and artificial intelligence settings.Computations with such datasets can be costly in terms of the hardware and computing power required, and their energy consumption raises important sustainability questions [9][10][11][12].Taking inspiration from the approach evinced by natural selection-that of sequentially selecting small amounts of sensory data-may help to alleviate the costs associated with big-data analysis.
To optimise data selection, we first need to identify an appropriate optimality criterion.In what follows, we base this upon the idea of expected information gain-a measure used in optimisation of experimental design [13,14], feature selection [15], and accounts of biological information-seeking and curiosity-driven behaviour [16,17].The idea behind this measure is that we form beliefs about hidden states or parameters in our model of how data are generated.Expected information gain is the degree to which we anticipate changing our beliefs under a given experimental design or data-sampling strategy.The implication is that optimisation of beliefs-and of data selection-work in tandem, as in Figure 1, and that both depend upon our model of how data are generated.
In what follows, we consider the form this model, and therefore data-selection, might take in different settings, starting with abstract function-approximation and progressing to a more realistic example based upon the idea of an adaptive clinical trial.Before we move to these examples, we unpack the basic theory behind active data selection.We outline the basic principles behind Bayesian inference, the role of generative models, and the formulation of expected information gain.Our overall aim is to provide an intuitive overview of the principles that underwrite active data selection, and to illustrate this with some simple examples.

Bayesian Inference, Generative Models, and Expected Information Gain
Bayesian inference is the process of inverting a model of how data (y) are generated to obtain two things [18].The first is the probability of observed data under our modelsometimes referred to as marginal likelihood.This is a ubiquitous measure of the evidence our data affords the hypothesis expressed in our model.Second is the probability of the random variables (θ) in the model given our data-known as a posterior probability.Analogous with action-perception cycles of the sort found in biological systems, it shows reciprocal interactions between the process of sampling data (in biology, through acting upon the world to solicit new sensations) and drawing inferences about these data (akin to perceptual inference).
Careful data selection is especially important when we consider the problems associated with very large datasets of the sort that are now ubiquitous in machine learning and artificial intelligence settings.Computations with such datasets can be costly in terms of the hardware and computing power required, and their energy consumption raises important sustainability questions [9][10][11][12].Taking inspiration from the approach evinced by natural selection-that of sequentially selecting small amounts of sensory data-may help to alleviate the costs associated with big-data analysis.
To optimise data selection, we first need to identify an appropriate optimality criterion.In what follows, we base this upon the idea of expected information gain-a measure used in optimisation of experimental design [13,14], feature selection [15], and accounts of biological information-seeking and curiosity-driven behaviour [16,17].The idea behind this measure is that we form beliefs about hidden states or parameters in our model of how data are generated.Expected information gain is the degree to which we anticipate changing our beliefs under a given experimental design or data-sampling strategy.The implication is that optimisation of beliefs-and of data selection-work in tandem, as in Figure 1, and that both depend upon our model of how data are generated.
In what follows, we consider the form this model, and therefore data-selection, might take in different settings, starting with abstract function-approximation and progressing to a more realistic example based upon the idea of an adaptive clinical trial.Before we move to these examples, we unpack the basic theory behind active data selection.We outline the basic principles behind Bayesian inference, the role of generative models, and the formulation of expected information gain.Our overall aim is to provide an intuitive overview of the principles that underwrite active data selection, and to illustrate this with some simple examples.

Bayesian Inference, Generative Models, and Expected Information Gain
Bayesian inference is the process of inverting a model of how data (y) are generated to obtain two things [18].The first is the probability of observed data under our modelsometimes referred to as marginal likelihood.This is a ubiquitous measure of the evidence our data affords the hypothesis expressed in our model.Second is the probability of the random variables (θ) in the model given our data-known as a posterior probability.These two things can be obtained by specifying the a priori plausible distributions of model variables and the likelihood of generating patterns of data given the values those variables might take.These prior and likelihood beliefs form our generative model.The relationship between the generative model and the associated marginal likelihood and posterior probability is given by Bayes' theorem, which can be expressed as: Equation ( 1) shows the generative model, its inversion, and the relationship between a model and its marginal likelihood.The second line shows that the marginal likelihood is obtained directly from the prior and likelihood beliefs simply by taking the expectation (or average) of the latter, assuming our θ variables are distributed according to the prior (also known as marginalisation).This leaves only one unknown in the first line-the posteriorwhich can then be obtained from rearrangement of the other three terms.In practice, θ may include many different parameters or hidden states, making exact computation impractical [19].While we will touch upon the topic of approximate inference briefly in one of the examples, this is of limited importance for the primary message of this paper.However, the question of how we deal with complex models with multiple parameters is important to address.
Bayesian networks, and related graphical notations, offer a visualisation of a generative model.They tell us which variables depend upon which other variables.This is useful in that it suggests efficient message passing schemes that may be used to perform an inference (see [20,21] for overviews).Such representations have broad applicability, ranging from clinical reasoning [22] to environmental conservation [23].For our purposes, message passing in graphical models is useful in that it helps us to find the quantities required for computing expected information gain.
The information we gain on making an observation can be formalised as the change in our beliefs as a consequence of that observation.A change in probabilities of this sort is quantified using a KL-Divergence (also known as relative entropy)-the average difference between the log probabilities before and after that observation (i.e., between the prior and posterior of Equation ( 1)).The value this KL-Divergence takes, averaged over the data we would predict using our marginal likelihood, is our expected information gain [14]: In Equation (2), we have conditioned our model upon the variable π, which represents a choice we can make in selecting our data.The first line specifies expected information gain directly as the expectation of an information gain.The second line expresses information gain in terms of mutual information that quantifies the degree of conditional dependence between parameters and data.If the parameters were independent of data, the joint distribution would be equal to the product of their marginals, rendering the divergence zero.The third line provides a further interpretation in terms of the difference between two entropies.Essentially, the greater the predictive uncertainty the greater the potential information we can gain, but only if the conditional entropy is sufficiently small that there is a precise mapping from parameters to the data they cause.
The final line of Equation (2) re-expresses the penultimate line by explicitly separating the joint distributions into their constituent singleton (i.e., marginal) and conditional distributions.This gives a sense of a recursive structure, in which the key constituents of information gain are themselves built up of expectations of expectations (of expectations as the model becomes more complex).This provides us with a method for quantifying the information gain about parameters in graphical models.Such models are typically expressed using a series of nodes (or edges, in factor graphs) representing variables linked together by the factors of their associated probability distributions.Typically, causative variables are placed towards the top of the graph and measurable data at the bottom.Pairs of nodes are linked if there is a conditional dependence between them.In other words, two nodes are depicted as linked if there is a (conditional probability) factor of the model in which the variable represented by one of the two nodes is in the conditioning set (the 'parent') and the variable represented by the other node is the support (the 'child') of that factor.
Using the notation pa( ) to represent parents of (i.e., the things something depends upon), and ch( ) to represent children of (i.e., the things that depend upon it), we can express the expected information gain in Equation ( 2)-about a parameter θ i -in terms of messages (µ) that are passed along the edges of the factor graph: The Λ operator here is introduced to indicate either summation or integrationdepending upon whether we are dealing with continuous or discrete variables-with respect to the variables indicated by the subscript.The recursive definitions of the messages (µ) in Equation (3) are exactly the definitions that underwrite belief-propagation inference schemes, including the sum-product algorithm [24].Superscripts and subscripts on the messages indicate the node or variable from which the message comes, and to which it is sent, respectively.In what follows, we consider the nature of the requisite messages.

A Worked Example
In Figure 2, we illustrate a simple worked example based on Equation (3) using a Bayesian network of normally distributed variables.The network includes three nodes.These are: the choice (π) we make about how to sample our data, the parameter (θ) we seek information about, and our data (y).The Bayesian network is first transformed into a factor graph-with explicit representation of factors of the joint probability distribution as boxes-and then to a (normal) Forney factor graph [25] that omits the variable nodes (circles).The Forney factor graph formulation is perhaps the easiest to use when visualising the passing of messages along the edges of the graph.The messages can then be used to construct the two entropies we need to compute expected information gain.
The model in this toy example has the following factors: The model in this toy example has the following factors: ( ) (0,1) In effect, this model amplifies or attenuates the amplitude of the predicted data depending upon a periodic function of our data-sampling policy, π.The upper-left panel of Figure 2 depicts the data sampled from this model for several possible values of θ.Computing the relevant messages for the information gain, we have (omitting normalising constants for simplicity and using ∅ to denote the empty set) The directions of these messages are illustrated in the bottom-right panel of Figure 2. We next substitute these into the first line of Equation (3) to find an expression for the information gain about the parameter θ.Once all terms that are constant with respect to π are eliminated, we are left with: Equation ( 6) is a special case of the third row of Table 1, which highlights analytical expressions for expected information gain for a few common model structures.This function is plotted in the lower-left panel of Figure 2. As we might intuit, the most informative places to sample data align with those in which differences in θ lead to large differences in the predicted data-i.e., in which our choice of π maximises the sensitivity with which y depends upon θ.Given that a periodic (cosine) function is used for the effect of how we In effect, this model amplifies or attenuates the amplitude of the predicted data depending upon a periodic function of our data-sampling policy, π.The upper-left panel of Figure 2 depicts the data sampled from this model for several possible values of θ.Computing the relevant messages for the information gain, we have (omitting normalising constants for simplicity and using ∅ to denote the empty set) The directions of these messages are illustrated in the bottom-right panel of Figure 2. We next substitute these into the first line of Equation (3) to find an expression for the information gain about the parameter θ.Once all terms that are constant with respect to π are eliminated, we are left with: Equation ( 6) is a special case of the third row of Table 1, which highlights analytical expressions for expected information gain for a few common model structures.This function is plotted in the lower-left panel of Figure 2. As we might intuit, the most informative places to sample data align with those in which differences in θ lead to large differences in the predicted data-i.e., in which our choice of π maximises the sensitivity with which y depends upon θ.Given that a periodic (cosine) function is used for the effect of how we chose to sample (π) the data (y), there are multiple peaks which coincide with the peaks and troughs of the periodic function.
* Cat indicates a categorical distribution, Dir a Dirichlet distribution, and ψ indicates a digamma function.See Appendix B of [26] for details of the approximation used in the expected information gain in the second row.

Function Approximation
We next turn to a generic supervised learning problem-that of trying to approximate some function based upon known inputs and the observable outcomes they (stochastically) cause.Our generative model for this is straightforward: The matrix (Φ) comprises a set of concatenated row vectors (ϕ i ).The columns of Φ are elements of some basis set.For the simulations that follow, these are Gaussian radial basis functions.In the MATLAB demo routines that accompany this paper, alternative basis sets (including cosine and polynomial) can be chosen.We can see this model as an approximation of an unknown function whose (discretised) input is the row index of Φ and whose output is the expected value of y corresponding to that row: Our choice of π is used to test alternative values of the input (x) to see the output (y) that the unknown function (f ) generates.Here, π indexes the discrete intervals of x implicit in the columns of Φ. Figure 3 illustrates a depiction of this model as a Bayesian network and a visual representation of the data-generating process.In this simulation, we move beyond our worked example in Figure 2 and consider multiple samples.In this setting, we replace our prior beliefs after each observation with our posterior beliefs, allowing a gradual refinement of our approximated function and the future choices on which this depends.
The posterior probability of the parameters following an input and the marginal probability of the data are: The middle-right panel of Figure 3 shows the predictive (marginal) probability distribution after 28 random choices for π.The lower-left panel shows the posterior probabilities of the parameters.The specific choices are shown in the lower-right panel.Even with these random choices, the 28 samples provide a reasonably good approximation to the shape of the function, as can be seen comparing the middle-left (large number of samples drawn from the function) and right panels.However, a closer look at the choices made reveals some clear inefficiencies.For instance, choices 2 and 4 and choices 6 and 7 are very close to one another (circled in red).Intuitively, we would expect that for any (smooth) function, it will almost always be preferable to sample locations far away from one another, as some uncertainty will have been resolved by samples nearby.This is where the information gain becomes relevant.Substituting Equation ( 7) into Equation (3), we have Figure 4 shows the same set-up as in Figure 3, but now samples are drawn from a distribution whose log is proportional to the information gain in Equation (10).During each iteration, a sample location is chosen from this distribution (a large proportionality constant makes this almost deterministic; i.e., the location with the highest information gain is almost always chosen).A datapoint is then sampled from the generative model under this choice.This is then used to update priors to posteriors using Equation ( 9).The process is then repeated up to 28 iterations.Unlike the random sampling, it is not until late in the simulation that we revisit a previously sampled location.The implication is that we can achieve better inferences with Unlike the random sampling, it is not until late in the simulation that we revisit a previously sampled location.The implication is that we can achieve better inferences with the same, or perhaps even a smaller, number of samples when we select these samples in a smart way.This raises the question as to how many samples we should collect.
An answer to this question can be drawn from behavioural neuroscience and the so-called exploitation-exploration dilemma [27,28].This deals with the situation in which certain parts of our environment may be particularly aversive or attractive-perhaps due to the presence of food or predators-but we do not know where these locations are to begin with.Initially, we must explore, seeking information about our environment, and then at some stage switch to exploiting the knowledge acquired, to ensure we occupy preferred locations.If all locations were equally preferrable, we could continue exploring forever, developing ever more precise beliefs about our world.The same is true when sampling data or performing experiments if there is no cost to performing this sampling.However, once we acknowledge the computational and resource costs to sampling, we conclude that if it were not for the potential information gain, it would be preferable to stop acquiring new data.
To accommodate this cost of sampling, we can simply augment the expected information gain with a prior belief about our propensity to stop sampling: Equation ( 11) treats the information gain as a vector of all the possible sampling locations and concatenates this with a zero element, which reflects the information gained if we were to stop sampling.The relative preference for sampling or not is then expressed in the C vector, which assigns a prior (in the form of a log probability) to each choice of sampling location.In neurobiology, this expression is sometimes referred to as an expected free energy-reflecting its analogous form (KL-Divergence and log marginal) to the variational free energy functionals used in approximate inferences [29][30][31].It allows us to combine information-seeking and cost-aversive imperatives into the same objective function.Figure 5 shows what happens when we select our data according to G[π], where the preference for stopping sampling takes the value 1/4 (i.e., C = [0,. ..,0,1/4] includes zeros for all but the 'stop sampling' option, which is assigned a value of 1/4).This means that when the expected information gain is sufficiently low the chance of deciding to stop sampling increases.
In Figure 5, we see that the same first 15 choices are made as in Figure 4.However, at this point, a decision is made to stop sampling.This is because sufficient uncertainty has been resolved that the cost of sampling is greater than the potential information gain by sampling further.Depending upon how costly the sampling is assumed to be, this decision to stop will move earlier or later, and the quality of the function approximation will vary.
A reasonable question to ask at this stage is why bother with the full informationseeking objective?It is clear from Figures 4 and 5 that all we need to do is choose to sample from the regions with greatest predicted variance, sequentially reducing that variance until it is minimal throughout the domain of the function being approximated [32].This suggests that the predictive entropy from Equation ( 2) is sufficient on its own (c.f., maximum entropy sampling [33]).However, Figure 6 offers a rebuttal to this with a process whose measurement noise increases in variance from the centre of the function domain.This means the amount of unresolvable uncertainty is heterogeneous throughout the domain of potential sampling.Were we to ignore this, the uncertainty left to resolve will always be greater than the cost of sampling.Using the full expected information gain tells us that, although there is uncertainty left, there is no information to gain, and allows for earlier termination of our sampling procedure.Furthermore, it leads to a methodical sampling strategy: sampling starts from the minimally ambiguous central location and fans out as local uncertainty is reduced until the capacity to resolve further uncertainty drops below the cost of acquiring further samples.This avoidance of sampling in ambiguous locations is sometimes referred to as a 'streetlight effect' [34] (see [35] for further discussion with models comprising categorical distributions), which occurs only in the presence of a full expected information gain objective.The streetlight effect is the tendency to search where data are generated in a minimally ambiguous way-i.e., under a streetlamp compared to searching elsewhere on a darkened street.Using the same format as in Figure 3, we here see the effect of equipping sampling with a cost.In the absence of any potential information gain, there is a preference for not sampling.The result of this is that the process of gathering data terminates after a smaller number of acquired samples.This termination occurs once the potential information gain falls below the cost of acquiring further samples.This means that the quality of the function approximation will depend upon the cost associated with sampling.In this example, a reasonable approximation to the function has been attained-inferior to that of Figures 3 and 4 but still capturing a broad outline of the function.Using the same format as in Figure 3, we here see the effect of equipping sampling with a cost.In the absence of any potential information gain, there is a preference for not sampling.The result of this is that the process of gathering data terminates after a smaller number of acquired samples.This termination occurs once the potential information gain falls below the cost of acquiring further samples.This means that the quality of the function approximation will depend upon the cost associated with sampling.In this example, a reasonable approximation to the function has been attained-inferior to that of Figures 3 and 4 but still capturing a broad outline of the function.

Figure 6.
Function approximation with unresolvable uncertainty.This figure deals with the situation in which the variance associated with the likelihood distribution is heterogeneous along the xaxis.Specifically, it increases quadratically with distance from zero.The result of this is that less information gain is available at the extremes of the x-axis.This leads to termination of sampling despite the relatively larger uncertainty about the function in these extremes.This is optimal in this context as there is a limit to the amount of additional uncertainty that can be resolved.

Dynamic Processes
In this section, we consider processes that evolve in time.For example, while the approach in Section 3 might be appropriate for mapping out the heights of a mountain range, it would not be suitable for measuring (for example) weather conditions or tectonic activity in different locations across that terrain, as these will change with time (although see [36,37] for closely related approaches to placing static sensors for environmental Figure 6.Function approximation with unresolvable uncertainty.This figure deals with the situation in which the variance associated with the likelihood distribution is heterogeneous along the xaxis.Specifically, it increases quadratically with distance from zero.The result of this is that less information gain is available at the extremes of the x-axis.This leads to termination of sampling despite the relatively larger uncertainty about the function in these extremes.This is optimal in this context as there is a limit to the amount of additional uncertainty that can be resolved.

Dynamic Processes
In this section, we consider processes that evolve in time.For example, while the approach in Section 3 might be appropriate for mapping out the heights of a mountain range, it would not be suitable for measuring (for example) weather conditions or tectonic activity in different locations across that terrain, as these will change with time (although see [36,37] for closely related approaches to placing static sensors for environmental monitoring).To move beyond a temporal snapshot, we must extend our models to include dynamics.
There are many ways to do this.One could set up a hidden Markov model, whose states probabilistically evolve according to discrete transition probabilities [38].An alternative would be to formulate a model in terms of a differential equation that determines the time evolution of hidden states generating data [39].A final option is to extend the model we used in Section 4 to include a temporal dimension.Specifically, in place of our onedimensional set of basis functions, we now two-dimensional basis functions with both temporal and spatial components.While we employ the third of these methods in this paper, each of these three approaches is perfectly valid.P(y, θ|π) = P(y|θ, π)P(θ) Equation ( 12) can be interpreted similarly to Equation (8), in which the expectation of the data is treated as a function approximation, which now includes a time argument.
However, the form of Equation ( 12) means we can use the same inferential machinery, and the same information gain calculations, as we did for the static inference examples.Figure 7 shows a graphical representation of the matrices involved in generating our data and the inferences obtained after sampling.Every eight time steps, a sample is chosen to maximise the expected information gain, and this is used to update posterior beliefs exactly as previously.As can be seen from the lower-right panel in Figure 7, an accurate inference has been obtained of the evolution of our function over time.The choices made when sampling provide a relatively even coverage of the spatial dimension.Note the implicit inhibition of return, which prevents resampling near recently sampled locations.However, once sufficient time has elapsed, it is worth returning to previously sampled locations as they may have changed since previously being sampled.
Figure 8 shows several snapshots of the inferences drawn at different time points during the simulation reported in Figure 7.This is designed to provide some intuition as to the filling in of predictions as more observations are made.Note that because this is a dynamic process-with a degree of smoothness both in space and time-predictions based upon the current data can be used to inform predictions about nearby spatial locations and to both predict and postdict the values of the function at different points in time.
In this and the previous section, we have demonstrated the way in which smart or optimal sampling may be used to select data in a manner that balances the cost of sampling or of performing further experiments against the information gained from those samples or experiments.Each of these examples has relied upon relatively simple, and analytically comfortable, linear Gaussian systems.Next, we address active sampling in a situation where analytical solutions are no longer possible.7.This is designed to provide some intuition as to the filling in of predictions as more observations are made.Note that because this is a dynamic process-with a degree of smoothness both in space and time-predictions based upon the current data can be used to inform predictions about nearby spatial locations and to both predict and postdict the values of the function at different points in time.In this and the previous section, we have demonstrated the way in which smart or optimal sampling may be used to select data in a manner that balances the cost of sampling or of performing further experiments against the information gained from those samples or experiments.Each of these examples has relied upon relatively simple, and analytically comfortable, linear Gaussian systems.Next, we address active sampling in a situation where analytical solutions are no longer possible.

Clinical Trials
In our final example, we demonstrate the potential utility of the ideas outlined above in the context of a more concrete example.Adaptive Bayesian clinical trials [40] have been increasingly popular in efficiently answering questions about medical interventions in rapidly evolving healthcare situations.A prominent example of this was the use of an adaptive Bayesian methodology to assess interventions during the 2014 West African Ebola outbreak [41].A key argument made in favour of this design was that in a rapidly developing epidemic it is necessary to be able to efficiently compare several possible treatments and to be able to adapt as new information emerges [42].For instance, we may wish to stop one treatment arm, or perhaps reallocate participants to another treatment arm, if there is early evidence of the inferiority or superiority of this treatment, respectively.These benefits are not restricted to the management of epidemics but may apply more broadly.See, for example [43][44][45][46][47][48][49].
We have two reasons for choosing a clinical trial design as our final example.The first is that they are common hypothesis-driven experimental designs with resource limitations.The second is that they offer us an opportunity to go beyond the illustrative linear models used above and force us to deal with more expressive model structures that require pragmatic approximations to be able to apply the methods outlined above efficiently.Our hope is to demonstrate that use of expected information gain does not restrict us only to the simplest model classes but has broad applicability in real-world problems.
The active sampling approach advocated in this paper offers two main opportunities to augment adaptive trial designs.First, it allows us to adapt the design (e.g., inclusion criteria and assignment to a treatment arm) to maximise the information we obtain about treatment efficacy.Second, it allows us to balance this information gain against various costs, namely, stopping the trial when the potential information gain no longer exceeds the cost of acquiring further data.Alternatively, it might include a preference for good clinical outcomes and aversion for bad outcomes-prompting adjustments of the allocations between treatment arms to maximise benefit as we infer more about the effect of the intervention.This blurs the line between clinical trial and public health intervention and can be seen as analogous to animal behaviour that is never fully exploitative or explorative but is a balance between the two.
Our setup is as follows: for each new cohort of participants, we decide upon the randomisation ratio to adopt and the time(s) at which we follow them up.To simplify for illustration purposes, we assume each cohort is followed up only once, noting that the same principles can be applied to multiple follow-up points.Our generative model can be expressed as follows: In this model, the data (y) are binary values reflecting survival up to the measurement time or death before this point.The probability of survival to a given time is the product of the survival probabilities for all previous time steps.Our probability for survival for a given time step is a logistic function that determines the effect of time since enrolment in the trial, the effect of demographic (here, sex), and the treatment effect.As in our previous examples, the time evolution is modelled using a Gaussian basis set.Patient demographics (d) are sampled from a categorical distribution.There are two choices that can be made.The first is the time at which follow up is performed (subscript τ).The second is the randomization ratio (subscript r).There are assumed to be three options for this ratio.These are 1/3, 1/2, or 2/3 allocated to the treatment group.The allocation is represented by the symbol v.We assume these decisions are taken before each cohort, where each cohort consists of eight participants.The parameterisation of the treatment effect (i.e., the final element of θ) can be interpreted as a hazard ratio, as might be estimated in a Kaplan-Meier [50] analysis, and is consistent with Cox proportional hazards assumptions [51].
This model differs from those in previous sections in that it includes highly nonlinear functions and is not based upon conjugate priors.This makes analytical solutions to the inference problem intractable.In place of this, we employ a variational approximation to arrive at posterior estimates.This involves a Newton optimization scheme (also known as Variational Laplace) applied to maximise a lower bound on the log marginal likelihood.The relevant gradient expression for this scheme is as follows: The Hessian is as follows: The variational optimization scheme then takes the form (where superscripts indicate iteration number) Figure 9 shows a simulation of this design with random sampling of both follow-up time and randomization ratio.The variational inference scheme here terminates following 16 iterations.The upper part of the figure shows the form of the generative model as a Bayesian network.The process used to simulate data is summarised in the plots of survival against time for each of the combinations of treatment versus placebo and male versus female.In this example, the treatment is harmful, leading to a lower survival in the treatment compared to the placebo group.This pattern is captured by the inference scheme following 16 cohorts and a relatively even sampling throughout time and randomization patterns.
Next, we wish to employ the information gain prior, to deliberately choose follow-up times and randomization ratios to maximise our information about the trajectories of each patient group.As above, the first step in doing so is to identify the form of the messages required for Equation (3).
We can simplify the estimation of these messages using a local quadratic approximation for the continuous parameters.This technique is sometimes referred to as Variational Laplace and is implicit in Equation ( 17) [52,53].
For the messages in the first term of Equation ( 18), we employ a quadratic expansion of the entropy around the posterior mode of the parameters: The second term of Equation ( 18) has the following form when the ρ function has been expanded to the second order:  Please see Equations ( 15) and ( 16) for the forms of the relevant gradients and Hessians.Figure 10 illustrates the same setup as in Figure 9 but now using the expected information gain from Equation (18) to guide sampling of data.Due to the approximations made to ensure tractability, we reduce the temperature (i.e., α from Figure 4) to four to acknowledge that there is some uncertainty as to whether the best choices are those consistent with the maximal value of Equation (18).There are some notable differences between the choices made in Figure 10 compared to Figure 9.The most obvious of these is that the follow-up times selected have been moved later once optimal sampling is employed.This makes intuitive sense as a later follow-up time is informative about the survival probabilities at all prior times, whereas an earlier follow-up time is not informative about survival probabilities at later times.This has led to a marginal improvement in estimation of the hazard functions.In both cases, the number of people assigned to each treatment group appears approximately equivalent, as would be expected under random sampling, with an expectation of 50% assignment to each group overall.Using the same format as Figure 9, this figure shows the result of active sampling that maximises (subject to some random noise) the expected information gain about the parameters.This leads to a slightly better predicted survival curve, which appears to be due to a tendency to select later follow-up times.Later times tend to have greater potential information gain as survival to a later time is informative about previous survival probabilities.
Our final simulation now incorporates the preferences discussed in previous sections.Here the preferences are expressed in terms of a prior probability that survival is more likely than death.This must be nuanced as a preference for observing survival alone would favour observations at the point of enrolment where the probability of survival is  9, this figure shows the result of active sampling that maximises (subject to some random noise) the expected information gain about the parameters.This leads to a slightly better predicted survival curve, which appears to be due to a tendency to select later follow-up times.Later times tend to have greater potential information gain as survival to a later time is informative about previous survival probabilities.
Our final simulation now incorporates the preferences discussed in previous sections.
Here the preferences are expressed in terms of a prior probability that survival is more likely than death.This must be nuanced as a preference for observing survival alone would favour observations at the point of enrolment where the probability of survival is 100%.To finesse this problem, we specify a preference for survival that increases exponentially in time.In other words, it is preferable to observe somebody having survived until the end of the potential follow-up period compared to having observed their survival shortly after enrolment.As shown in Figure 11, this preference for long-term survival results in longer follow-up times and, interestingly, once an inference that the treatment is potentially harmful has been made, there is a gradual shift towards randomising more people towards the placebo group.This implies a similar reasoning to when clinical trials are terminated early when a clear benefit of membership of one of the two treatment arms has been identified.100%.To finesse this problem, we specify a preference for survival that increases exponentially in time.In other words, it is preferable to observe somebody having survived until the end of the potential follow-up period compared to having observed their survival shortly after enrolment.As shown in Figure 11, this preference for long-term survival results in longer follow-up times and, interestingly, once an inference that the treatment is potentially harmful has been made, there is a gradual shift towards randomising more people towards the placebo group.This implies a similar reasoning to when clinical trials are terminated early when a clear benefit of membership of one of the two treatment arms has been identified.Sampling with preference for long-term survival.Again, using the format of Figure 9, this figure demonstrates the effect of preferring long-term survival.As might be expected, this bias is the follow-up time towards the end of the trial period.Perhaps more interestingly, although initially randomising to both treatment arms, once it becomes apparent that long-term survival is compromised in the treatment group, these preferences bias randomization towards the placebo group.

Discussion
This paper's focus has been on illustrating how we might make use of informationseeking objectives-augmented with costs or preferences-to choose the best data to optimise our inferences.While we have highlighted specific examples of active sampling or optimal Bayesian design, we have not provided a systematic analysis of the benefits of this approach here-instead favouring demonstrations of application to different sorts of models and problems.We have considered the problem of function approximation when there is a cost to data acquisition, or to the computational demands when sampling data from large datasets.We extended this to consider dynamic inference when the data obtained from the same experiment, address, or abstract location may change as a function of the time at which they are acquired.Finally, we illustrated the use of these methods even when the problem is not reducible to a general linear model.
It is worth noting that several of our examples would yield identical results had we adopted a data-sampling approach based upon maximum entropy sampling [54,55].However, the key differences emerge when the variance around predicted outcomes is inhomogeneous.We illustrated this both in the static inference example in Figure 6 and also in our clinical trial example.Note that the covariance associated with the likelihood distribution is a function of the Hessian of the log likelihood.Equation (15) shows that this is not constant in the nonlinear setting considered in our final example.
There are several technical points worth considering for how we might advance the concepts reviewed in this paper.These are broadly: (1) refinement of the active selection process, (2) empirical evaluation of active versus alternative sampling methods, and (3) identifying the appropriate cost functions.Sophisticated inference is a method that might address the first of these [56].This deals with the situation in which there is a dependence between the information gained from one choice and the potential information that can be gained following future decisions.In addition to this, it can help prioritise those sources of information that minimise cost in the future.Essentially this involves solving a recursive planning problem, like the approach used to solve Bellman optimality problems.The utility of sophisticated recursions is obvious when we consider how we would approach a clinical trial with multiple follow-up points, as the timing of the second follow-up will depend upon the information gain we anticipate from the first.
The issue of empirical validation is one that can be addressed through two routes.The first is in selecting a limited number of samples from a large dataset through either an active sampling approach or through some alternative data selection strategy.By using the posterior probability distributions under each sampling strategy as priors and comparing the marginal likelihood for the two sets of priors-assessed against the remainder of the dataset-we could evaluate the sampling strategy that leads to the best model.The second route to assessing the utility of active sampling is through comparing the energy costs and computing time required to achieve a reasonable approximation of a function using a limited dataset compared to that required to take account of a larger and more comprehensive dataset.
The identification of appropriate cost functions may be a trickier problem as these will vary from application to application and with the resources available.This has particular significance for clinical trial design, where the relative preference for different outcomes clearly has important ethical connotations.One approach would be to reverse-engineer the implicit cost functions employed in previous experimental designs (for example, the cost functions that would lead to a given early stopping criteria).An alternative is to find a way to synthesise and formalise views from the relevant stakeholders.For an interesting example of this in the context of determining outcomes for cystic fibrosis trials, see [57,58].

Conclusions
In this paper, we have sought to demonstrate that an appeal to the action-perception cycles seen in biology may be helpful in addressing the difficulties of managing very large datasets or in acquiring costly experimental data.The key ideas involved an appeal to foveal-like sampling of small portions of the total available data to minimise computational cost, the ways in which we can select the most informative data, and the way in which we can trade this off directly against the costs of sampling further datapoints.Drawing from problems of exploitation and exploration-and making use of the same expected information gain measures that seem to be used in biological cognition-we have seen that it is possible to trade off the benefits of uncertainty resolution against its costs.A possible area of application for this is in the realm of Bayes-adaptive clinical trial design.Healthcare research inherently balances desired health outcomes with the resolution of uncertainty about how to achieve these.However, the broader imperative to balance the costs and benefits of how we sample data is likely to become more acute in an era of very large datasets and energy-hungry models.

Figure 1 .
Figure 1.The active sampling cycle.This graphic illustrates the basic idea behind this paper.Analogous with action-perception cycles of the sort found in biological systems, it shows reciprocal interactions between the process of sampling data (in biology, through acting upon the world to solicit new sensations) and drawing inferences about these data (akin to perceptual inference).

Figure 1 .
Figure 1.The active sampling cycle.This graphic illustrates the basic idea behind this paper.Analogous with action-perception cycles of the sort found in biological systems, it shows reciprocal interactions between the process of sampling data (in biology, through acting upon the world to solicit new sensations) and drawing inferences about these data (akin to perceptual inference).

Figure 2 .
Figure 2. A worked example.This figure offers a worked example of how we compute expected information gain.It draws from the message passing perspective set out in Equations (3) and (4) which detail the analytic computations that lead us to expected information gain as a function of our choices, π, for this generative model.The colours in the upper left plot represent different values of ϴ.The red arrows in the lower right factor graph indicate the directions of messages being passed.

Figure 2 .
Figure 2. A worked example.This figure offers a worked example of how we compute expected information gain.It draws from the message passing perspective set out in Equations (3) and (4) which detail the analytic computations that lead us to expected information gain as a function of our choices, π, for this generative model.The colours in the upper left plot represent different values of θ.The red arrows in the lower right factor graph indicate the directions of messages being passed.

26 Figure 3 .
Figure 3. Function approximation with random sampling.The upper part of this figure illustrates the generative model schematically in terms of a Bayesian network.Here, a set of 16 parameters with zero-centred normally distributed priors weight the columns of a matrix of Gaussian basis functions (only three are shown in the schematic).The action we choose when generating our data selects which row of the matrix is used to generate that outcome.The middle-left panel shows the result of taking 200 or so samples from the underlying function that we seek to approximate.The middle-right panel shows the predictive distribution following the final (28th) sequential sampleshown as a blue dot-in terms of mode and 90% credible intervals.The lower-left panel shows the posterior parameter modes and 90% credible intervals, while the lower-right panel shows the sequence of choices (from top to bottom) aligned with the x-axis of the plot above.The red circle samples illustrate redundant or inefficient sampling under this random scheme.

Figure 3 .
Figure 3. Function approximation with random sampling.The upper part of this figure illustrates the generative model schematically in terms of a Bayesian network.Here, a set of 16 parameters with zero-centred normally distributed priors weight the columns of a matrix of Gaussian basis functions (only three are shown in the schematic).The action we choose when generating our data selects which row of the matrix is used to generate that outcome.The middle-left panel shows the result of taking 200 or so samples from the underlying function that we seek to approximate.The middle-right panel shows the predictive distribution following the final (28th) sequential sample-shown as a blue dot-in terms of mode and 90% credible intervals.The lower-left panel shows the posterior parameter modes and 90% credible intervals, while the lower-right panel shows the sequence of choices (from top to bottom) aligned with the x-axis of the plot above.The red circle samples illustrate redundant or inefficient sampling under this random scheme.

26 Figure 4 .
Figure 4. Function approximation with intelligent sampling.This figure uses the same format as Figure 3, but now each choice is sampled to maximise anticipated information gain.Note the more accurate approximation to the region between −100 and −50 in the middle-right panel.The α (i.e., precision or inverse temperature) parameter here is set to 64 to ensure almost deterministic sampling of the most informative locations.

Figure 4 .
Figure 4. Function approximation with intelligent sampling.This figure uses the same format as Figure 3, but now each choice is sampled to maximise anticipated information gain.Note the more accurate approximation to the region between −100 and −50 in the middle-right panel.The α (i.e., precision or inverse temperature) parameter here is set to 64 to ensure almost deterministic sampling of the most informative locations.

Algorithms 2024 , 26 Figure 5 .
Figure 5. Function approximation with cost of sampling.Using the same format as in Figure3, we here see the effect of equipping sampling with a cost.In the absence of any potential information gain, there is a preference for not sampling.The result of this is that the process of gathering data terminates after a smaller number of acquired samples.This termination occurs once the potential information gain falls below the cost of acquiring further samples.This means that the quality of the function approximation will depend upon the cost associated with sampling.In this example, a reasonable approximation to the function has been attained-inferior to that of Figures3 and 4but still capturing a broad outline of the function.

Figure 5 .
Figure 5. Function approximation with cost of sampling.Using the same format as in Figure3, we here see the effect of equipping sampling with a cost.In the absence of any potential information gain, there is a preference for not sampling.The result of this is that the process of gathering data terminates after a smaller number of acquired samples.This termination occurs once the potential information gain falls below the cost of acquiring further samples.This means that the quality of the function approximation will depend upon the cost associated with sampling.In this example, a reasonable approximation to the function has been attained-inferior to that of Figures3 and 4but still capturing a broad outline of the function.

Figure 7 .
Figure 7. Dynamic information seeking.This figure deals with the situation in which the function we are trying to map out changes with time.The generative model is summarised in the upper part of this figure as the product of a matrix of coefficients with two matrices of basis functions.The data generated now depend both upon our choice and upon time.The plot of sample data shows data we could have sampled from the function at the final time point by taking the final row of the plot in the lower left and generating samples from this for each possible sampling location.This is to afford a comparison between a relatively exhaustive sampling of the data and the inferred shape of the function in the prediction plot based upon only one sample from this time point.The prediction plot shows the predicted data at this time point.The plot of posterior modes shows the estimated coefficients with 90% credible intervals at the final time point.The plot of choices is like that shown in previous figures.However, each choice occurs eight time steps after the previous choice.The lower two plots show both the time evolution of the underlying function generating the data and the inferred (expected) values of this function at each time point.We see that the shape of this

Figure 7 .
Figure 7. Dynamic information seeking.This figure deals with the situation in which the function we are trying to map out changes with time.The generative model is summarised in the upper part of this figure as the product of a matrix of coefficients with two matrices of basis functions.The data generated now depend both upon our choice and upon time.The plot of sample data shows data we could have sampled from the function at the final time point by taking the final row of the plot in the lower left and generating samples from this for each possible sampling location.This is to afford a comparison between a relatively exhaustive sampling of the data and the inferred shape of the function in the prediction plot based upon only one sample from this time point.The prediction plot shows the predicted data at this time point.The plot of posterior modes shows the estimated coefficients with 90% credible intervals at the final time point.The plot of choices is like that shown in previous figures.However, each choice occurs eight time steps after the previous choice.The lower two plots show both the time evolution of the underlying function generating the data and the inferred (expected) values of this function at each time point.We see that the shape of this function at different time points, despite only the sparse sampling of the data, is well captured by this scheme.

Figure 8
Figure8shows several snapshots of the inferences drawn at different time points during the simulation reported in Figure7.This is designed to provide some intuition as to the filling in of predictions as more observations are made.Note that because this is a dynamic process-with a degree of smoothness both in space and time-predictions based upon the current data can be used to inform predictions about nearby spatial locations and to both predict and postdict the values of the function at different points in time.

Figure 8 .
Figure 8. Trajectory of beliefs.This figure captures several snapshots of the time-evolving inference process summarised in Figure 7.The first row shows the inferences made following the first data sample.The second row shows the inferences made at time step 50.The third shows the inferences made by the end of the time period.Note the improvement in predictive precision in the second and third rows, corresponding to the use of prior information from previous time steps.

Figure 8 .
Figure 8. Trajectory of beliefs.This figure captures several snapshots of the time-evolving inference process summarised in Figure 7.The first row shows inferences made following the first data sample.The second row shows the inferences made at time step 50.The third shows the inferences made by the end of the time period.Note the improvement in predictive precision in the second and third rows, corresponding to the use of prior information from previous time steps.

Figure 9 .
Figure 9. Random sampling for a Randomised Control Trial.The upper part of this figure shows the form of the generative model as a Bayesian network with the factors shown below.The underlying dynamics of the simulated trial are shown in the four plots on the middle left.These show the expected survival curves for each combination of demographics and treatment group.The letter T indicates the treatment group, the letter P indicates the placebo group, the letter F indicates females, and the letter M indicates males.The middle-right plot shows predicted survival from the time of enrolment (the units of the time axis are arbitrary).The thin red and blue lines give the true survival curves based upon the underlying generative process.The predictions are shown as thicker lineswith accompanying 90% credible intervals.Below this is a plot of choices which takes the same format as in previous figures but now also reports the choice randomization ratio.This is shown by the size of the markers for each choice.There are three sizes of marker representing, from small to large, 2/3 allocated to a treatment group, 1/2 allocated to a treatment group, and 1/3 allocated to a treatment group.The lower-right plot shows the cumulative number of people allocated to the treatment (blue) and placebo (red) groups.Finally, the lower-left plot shows the posterior estimates of the parameters, with parameter five representing the effect of demographic on survival and parameter six representing the effect of treatment on survival.

Figure 9 .
Figure 9. Random sampling for a Randomised Control Trial.The upper part of this figure shows the form of the generative model as a Bayesian network with the factors shown below.The underlying dynamics of the simulated trial are shown in the four plots on the middle left.These show the expected survival curves for each combination of demographics and treatment group.The letter T indicates the treatment group, the letter P indicates the placebo group, the letter F indicates females, and the letter M indicates males.The middle-right plot shows predicted survival from the time of enrolment (the units of the time axis are arbitrary).The thin red and blue lines give the true survival curves based upon the underlying generative process.The predictions are shown as thicker lines with accompanying 90% credible intervals.Below this is a plot of choices which takes the same format as in previous figures but now also reports the choice randomization ratio.This is shown by the size of the markers for each choice.There are three sizes of marker representing, from small to large, 2/3 allocated to a treatment group, 1/2 allocated to a treatment group, and 1/3 allocated to a treatment group.The lower-right plot shows the cumulative number of people allocated to the treatment (blue) and placebo (red) groups.Finally, the lower-left plot shows the posterior estimates of the parameters, with parameter five representing the effect of demographic on survival and parameter six representing the effect of treatment on survival.

Algorithms 2024 , 26 Figure 10 .
Figure10.Intelligent sampling for a Randomised Control Trial.Using the same format as Figure9, this figure shows the result of active sampling that maximises (subject to some random noise) the expected information gain about the parameters.This leads to a slightly better predicted survival curve, which appears to be due to a tendency to select later follow-up times.Later times tend to have greater potential information gain as survival to a later time is informative about previous survival probabilities.

Figure 10 .
Figure10.Intelligent sampling for a Randomised Control Trial.Using the same format as Figure9, this figure shows the result of active sampling that maximises (subject to some random noise) the expected information gain about the parameters.This leads to a slightly better predicted survival curve, which appears to be due to a tendency to select later follow-up times.Later times tend to have greater potential information gain as survival to a later time is informative about previous survival probabilities.

Figure 11 .
Figure 11.Sampling with preference for long-term survival.Again, using the format of Figure9, this figure demonstrates the effect of preferring long-term survival.As might be expected, this bias is the follow-up time towards the end of the trial period.Perhaps more interestingly, although initially randomising to both treatment arms, once it becomes apparent that long-term survival is compromised in the treatment group, these preferences bias randomization towards the placebo group.

Figure 11 .
Figure 11.Sampling with preference for long-term survival.Again, using the format of Figure9, this figure demonstrates the effect of preferring long-term survival.As might be expected, this bias is the follow-up time towards the end of the trial period.Perhaps more interestingly, although initially randomising to both treatment arms, once it becomes apparent that long-term survival is compromised in the treatment group, these preferences bias randomization towards the placebo group.