A Bayesian Framework for Information-Theoretic Probing

Pimentel et al. (2020) recently analysed probing from an information-theoretic perspective. They argue that probing should be seen as approximating a mutual information. This led to the rather unintuitive conclusion that representations encode exactly the same information about a target task as the original sentences. The mutual information, however, assumes the true probability distribution of a pair of random variables is known, leading to unintuitive results in settings where it is not. This paper proposes a new framework to measure what we term Bayesian mutual information, which analyses information from the perspective of Bayesian agents—allowing for more intuitive findings in scenarios with finite data. For instance, under Bayesian MI we have that data can add information, processing can help, and information can hurt, which makes it more intuitive for machine learning applications. Finally, we apply our framework to probing where we believe Bayesian mutual information naturally operationalises ease of extraction by explicitly limiting the available background knowledge to solve a task.


Introduction
recently undertook an information-theoretic analysis of probing. They argue that probing may be viewed as approximating the mutual information between a linguistic property (e.g., part-of-speech tags) and a contextual representation (e.g., BERT). Counter-intuitively, however, due to the data-processing inequality, contextual representations contain exactly the same information about any task as the original sentence, under mild conditions. When viewed under this lens, the goal of probing is not inherently clear. One limitation of Pimentel et al.'s analysis is that it focuses on the mutual information (MI)-to be of practical application, their argument requires that a probe matches the true distribution according to which the data were generated in the limit of finite training data. In contrast, our paper formulates an information-theoretic framework that is compatible both with model misspecification and the finite data assumption.
In his seminal work, Shannon (1948) occupied himself with the limit of communication. Indeed, mutual information can be described as the theoretical limit (or upper-bound) of how much information can be extracted from one random variable about another. However, this limit is only achievable when one has full knowledge of these random variables, including the true probability distribution according to which they are distributed. In practice, we will not have access to such information and it may be difficult to approximate. It follows that any system with imperfect knowledge of the random variable's true distribution will only be able to extract a subset of this information.
With this in mind, we propose and motivate an agent-based framework for measuring information. We term our quantity Bayesian mutual information and show that it generalises Shannon's MI, holistically accounting for uncertainty within the Bayesian paradigm-measuring the amount of information a rational agent could extract from a random variable under partial knowledge of the true distribution. In addition to the definition, our paper provides many useful theoretical results. For instance, we prove that conditioning does not necessarily reduce Bayesian entropy and that Bayesian mutual information does not obey the data-processing inequality. We argue that these properties make our Bayesian framework ideal for an analysis of learned representations.
In the empirical portion of our paper, we investigate both part-of-speech tagging and dependency arc labelling. Moreover, our information-theoretic measure holistically captures a notion of ease of extraction, limiting the amount of data available to solve the task. Intuitively, Bayesian MI shows that high dimensional representations, such as BERT, actually hurt performance in the very low-resource scenario, making less information available to a Bayesian agent than a simple categorical distribution. This is because, when little data is available, these agents overfit to the evidence under weak priors. In the high-resource scenario of English, ALBERT dominates the curves, making more information available than other contextualised embedders. In short, Bayesian mutual information reconciles the probing literature with its frequently posed question: how much information can be extracted from these representations?

Background: Information Theory
Information theory (Shannon, 1948;Cover and Thomas, 2006) provides us with a number of tools to analyse data and their associated probability distributions-among which are the entropy and mutual information. These are traditionally defined according to a "true" probability distribution, i.e. p(x) or p(x, y) 1 which may not be known, but dictates the behaviour of random variables X and Y . The atomic unit of information theory is the surprisal, which is defined as follows: Arguably, the most important information-theoretic definition is its expected value, termed entropy: Finally, another important concept is the mutual information (MI) between two random variables Unfortunately, information theory has a few properties which do not conform to our intuitions about the mechanics of information in machine learning: (i) Data Does Not Add Information: The entropy is defined according to a source distribution p(x). So, if multiple instances of X are sampled i.i.d. from p(x), access to a set d N = x (1) , . . . , x (N ) of such instances cannot provide any information, i.e.
(ii) Conditioning Reduces Entropy: Another basic result from information theory is that conditioning cannot increase entropy, only reduce it, i.e. H(X | Y ) ≤ H(X). This implies datapoints can never be misleading, which is not true in practice.
(iii) Data Processing Does Not Help: The data processing inequality states that processing some random variable with a function f (·) can never increase how informative it is, but only reduce its information content, i.e. I(X; f (Y )) ≤ I(X; Y ).

Background: Belief Entropy
A related question that arises is how to estimate information in scenarios where the true distribution is not known. For instance, what is the surprisal of a learning agent with a belief p θ (x) who encounters an instance x? The straightforward answer would be to use eq. (1)-nonetheless, this agent does not know the true distribution p(x). This agent's surprisal is usually taken according to its belief: Similarly, this agent's entropy has been historically defined exclusively according to this belief distribution (Gallistel and King, 2011): We term this the belief-entropy. We can further extend this to a belief mutual information: 2 We note this definition is not grounded in the true distribution in any form. In fact, about the belief mutual information, Gallistel and King (2011) state: "the subjectivity that it implies is deeply unsettling [...] the amount of information actually communicated is not an objective function of the signal from which the subject obtained it".

A Bayesian Approach to Information
The primary motivation for this paper is developing a series of tools that help us overcome the limitations of traditional information theory as applied to machine learning. Specifically, probing representations requires a data-dependent information theory. We thus formulate analogues of surprisal, entropy and MI in terms of Bayesian agents-using a framework heavily inspired by Bayesian experimental design (Lindley, 1956). We then prove this framework does not suffer the same infelicities as standard information theory in this context.

An Agent-Based Information Theory
Our discussions will focus on Bayesian agents, so we start by formally defining them.
Definition 1. A Bayesian agent is a parameterised probability distribution p θ (x | θ) (or set of distributions) and a prior p θ (θ). 3 Given data d N = {x (1) , . . . , x (N ) }, the Bayesian posterior over θ is Analogously, the Bayesian belief is defined as the following posterior predictive distribution Upon encountering an instance x, and after seeing a collection of data d N , this agent's posteriorpredictive Bayesian surprisal will be: where D N is a data-valued random variable; for notational succinctness, we omit this random variable for the rest of the paper. We further define the posterior-predictive Bayesian entropy: 3 In the case that the Bayesian agent has more than one distribution, we still only have a single prior without loss of generality. Indeed, separate priors for each distribution is a special case where the parameters are partitioned. In the case of θ = [φ; ψ], we could define p θ (θ) = p θ (φ) · p θ (ψ).
As can be readily seen, the Bayesian entropy is the expected value of the Bayesian surprisal-with this expectation taken over the true distribution. 4 In this sense, the Bayesian entropy is a cross-entropy rather than a standard entropy.

Bayesian Mutual Information
Defining Bayesian mutual information within our framework requires a bit more care. First, in contrast to surprisal and entropy, mutual information is a functional of two random variables. We will name the second random variable Y . To talk about mutual information, we will consider a Bayesian agent with a collection of at least two beliefs, e.g. {p θ (x), p θ (x | y)}. The second belief is conditional, but otherwise follows Definition 1.
Definition 2. Given a collection of data d N = {(x (1) , y (1) ), . . . , (x (N ) , y (N ) )}, and a Bayesian agent with a pair of beliefs p θ (x | θ) and p θ (x | y, θ) and a prior p θ (θ), the Bayesian mutual information (Bayesian MI) is defined as There is an important distinction between the Bayesian and Shannon MI-Bayesian MI decomposes as the difference between two crossentropies, as opposed to two entropies. 5

An Illustrative Example
For the sake of argument, we assume two independent categorical random variables X and Y , both with c classes and uniformly distributed.
We further assume a Bayesian agent with two categorical beliefs {p θ (x) = Cat(θ), p θ (x | y) = Cat(θ + y)}-where y is assumed to be encoded as a one hot vector-and a Dirichlet prior 4 We put this in contrast to eq. (5)-which takes this expectation over the belief itself-since the instances are in practice encountered with this true frequency. This distinction has been explicitly noted before, by e.g. Bartlett (1953). 5 Cross mutual information (XMI) has been used in several previous work such as (Pimentel et al., 2019(Pimentel et al., , 2020bBugliarello et al., 2020;McAllester and Stratos, 2020;Torroba Hennigen et al., 2020;Fernandes et al., 2021;O'Connor and Andreas, 2021). In those works, though, it was usually interpreted as a computational approximation to the truth-MI (or to V-information (Xu et al., 2020), which is discussed later in the paper). In this work, we highlight the Bayesian MI's (and XMI's) relevance as a generalisation of Shannon's MI. p θ (θ) = Dir(α) with concentration parameters α = 1. Note that this (biased) agent believes that y and x are more likely than chance to share a class. Given no data, or given d 0 , this agent's prior predictive distributions are: In this example: (i) Mutual Information. We have I(X; Y ) = 0 because X and Y are independent by construction.
(ii) Belief Mutual Information. The belief-MI is positive, since the agent's uncertainty about X is reduced by knowledge of Ythe prior predictive p θ (x) is uniform, while the conditional distribution p θ (x | y) is not, which reduces the belief-entropy. This means that (iii) Bayesian Mutual Information. Finally, the Bayesian MI is negative-since x is uniformly distributed, the unconditional Bayesian entropy is tight, i.e. H θ (X | d 0 ) = H(X), but the conditional one is not, i.e.
We thus have I θ (X; Y ) < 0. This entails that, on this specific example, an agent's predictive power over X is lower when given Y .
This illustrates an important aspect of Bayesian MI: it is grounded on the true distribution.

Theoretical Properties
We now prove a few relevant theoretical properties about our framework. We show that Bayesian MI is symmetric if and only if the agent's beliefs respect Bayes' rule. Then, we discuss why it does not respect the data-processing inequality, and its connection to mutual information and to V-information (Xu et al., 2020).

When is Bayesian MI Symmetric?
It is a well known result that Shannon's MI is symmetric, i.e.
This means that the knowledge one can extract from random variable Y about X is the same as the knowledge one can extract from X about Y . This is not true in general for Bayesian MI; as we will show, information-theoretic symmetry and Bayes' rule are tightly related. As such, we consider in this section a Bayesian agent with a set of beliefs the following theorem characterises when we have symmetry.
for all distributions p(x, y) if and only if the Bayesian agent is consistent.

No Data-Processing Inequality
Another classical result from information theory is the data processing inequality. This theorem states that processing a random variable can never add information, only reduce it Although theoretically sound, this theorem is very unintuitive from a practical perspectiveeffectively, processing noisy data can make it more useful. In fact, representation learning is a subfield of machine learning devoted precisely to finding functions which can extract more "informative" representations from some input. One such example is BERT (Devlin et al., 2019), a large pre-trained language model which produces contextualised representations from sentential inputs. These representations provably contain the exact same information about any task as the original sentence (Pimentel et al., 2020b)-in practice, though, they are much more useful for downstream models.
The data processing inequality does not hold for Bayesian information, making it a more intuitive information-theoretic measure for probing; pre-trained representation extraction functions can increase MI from a Bayesian agent perspective.
Theorem 2. The data processing inequality does not hold for Bayesian information, i.e.

Relation to Mutual Information
The relationship between Bayesian mutual information and Shannon MI is relevant for our discussion. As mentioned in the introduction, Shannon was concerned with the limits of communication when he defined his measure. We now put forward an intuitive theorem about Bayesian information; it is upper-bounded by the true MI under a weak assumption about the agent's beliefs. Theorem 3. Assuming the agent's belief p θ (x | d N ) has a smaller Kullback-Leibler (KL) divergence when compared to the true p(x) than the marginal of its beliefs over y, i.e.
In other words, the information any agent can extract from a random variable Y about another variable X is upper-bounded by the true MI. We now define a well-formed belief, which we will use to analyse the Bayesian MI's convergence: Definition 3. We say the belief of a Bayesian agent is well-formed if and only if the true distribution is a possible belief, i.e.
Given this definition, we prove the Bayesian mutual information converges to the true MI under welldefined conditions. Theorem 4. If we assume a Bayesian agent's set of beliefs and prior are well-formed and meet the conditions of Bernstein-von Mises Theorem (pg. 339, Bickel and Doksum, 2001). 6 Then, Proof. See App. G. 6 In the case where θ is discrete and finite, the only requirement is p θ (θ) > 0, for all values of θ (Freedman, 1963).

Relation to Variational Information
Variational (V-) information (Xu et al., 2020) is a recent generalisation of mutual information. It extends MI to the case where a fixed family of distributions is considered; in which the true distribution may or not be.
Definition 4. Suppose random variable X is distributed according to p(x). Let V be a variational family of distributions. Then, V-entropy is defined and V-information is defined as . Thus, it does not meet our desiderata. However, we can prove a straightforward relationship between the Bayesian and V informations, which we state below.
Theorem 5. Assume a Bayesian agent's beliefs and prior meet the conditions of Kleijn and van der Vaart (2012), who extend the Bernstein-von Mises Theorem to beliefs which are not well-formed. Fur- Proof. See App. H.

A Framework for Incremental Probing
The proposed Bayesian framework for information allows us to take into account the amount of data we have for probing. Crucially, previous work (Pimentel et al., 2020b) failed to adequately account for the observation of data. In doing so, they only analysed the limiting behaviour of information, under which the probing enterprise is not fully sensible-given unlimited data and computation, there is no point in using pre-trained functions. Indeed, the higher-level motivation of this work is to find an information-theoretic framework which serves machine learning, and under which the goal of probing is inherently clear. To that end, we propose a relatively simple experimental design. We compute Bayesian mutual information, which is a function of the amount of data, to create several learning curves.
Notation. We define a sentence-level random variable S, with instances s taken from V * , the Kleene closure of a potentially infinite vocabulary V . We further define a representation-valued random variable R and a task-valued random variable T , each with instances r ∈ R d and t ∈ T , where T is the set of possible values for the analysed task (e.g. the set of parts of speech in a language).

Probes as Bayesian Agents
The overall trend in NLP is to train supervised probabilistic models on task-specific data. We believe probabilistic probes should analogously be modelled this way-leading to results compatible with our empirical intuitions. We thus define a probe agent as a Bayesian agent with the pair of beliefs {p θ (t | θ), p θ (t | r, θ)} and a prior p θ (θ). Any prior p θ (θ) could be chosen for our probing agents. Nonetheless we have no a priori knowledge of how the representations should impact our prediction task. As such, our priors are such that the initial distributions p θ (t | d 0 ) and p θ (t | r, d 0 ) are identical. A logical conclusion, is that the prior Bayesian MI should be zero: On the opposite extreme-i.e. given unlimited data-a well-formed belief will likely converge to the true distribution, yielding the same results as by Pimentel et al. (2020b). Complementarily, an illformed belief will converge to the V-information: The novelty of our framework lies in the explicit analysis of information under finite data. Bayesian agents are used here to measure a notion of information directly related to ease of extraction-i.e. how much information could be extracted from the representations by a naïve agent with no a priori knowledge about the task itself. In other words, we ask the question: given a specific dataset d N , how much information do the representations yield about this task? This value is only a subset of the true MI, being upper-bounded by it.

Why Bayesian MI and not Bayesian entropy?
We focus our analysis on the amount of information a Bayesian agent can extract from the representations about the task. However, we could as easily analyse the Bayesian entropy instead. We believe, though, that the Bayesian MI is an inherently more intuitive value than the entropy. This is because mutual information puts the Bayesian entropy in perspective to a trivial baseline-how much uncertainty would there be without the representations. Furthermore, it has a much more interpretable value: with no data its value is zero, while at the limit it converges to the true mutual information. In this paper, we are concerned with its trajectory, i.e., how fast does the Bayesian MI go up?

Ease of Extraction and Previous Work
Generally speaking, the goal of probing is to test if a set of contextual representations encodes a certain linguistic property (Adi et al., 2017;Belinkov et al., 2017;Tenney et al., 2019;Liu et al., 2021, inter alia). Most work in this field claims that, when performing this analysis, we should prefer simple models as probes (Alain and Bengio, 2016;Hewitt and Liang, 2019;Voita and Titov, 2020). This is inline with Pimentel et al.'s results: using a complex probe (complex enough to ensure it is well-formed) with infinite data, we would estimate I(S; T )-a value which does not meaningfully inform us about the representations themselves. Defining model complexity, though, is not trivial (for a longer discussion see Pimentel et al., 2020a). For this reason, many works limit themselves to studying only linearly encoded information (e.g. Alain and Bengio, 2016;Hewitt and Manning, 2019;Hall Maudslay et al., 2020) or a subset of neurons at a time (e.g. Torroba Hennigen et al., 2020;Mu and Andreas, 2020;Durrani et al., 2020). However, restricting our analysis this way seems arbitrary.
A few recent papers have tried to deal with probe complexity in a more nuanced way. Hewitt and Liang (2019) argue for the use of selectivity to control for probe complexity. Voita and Titov (2020) and Whitney et al. (2020) use, respectively, minimum description length (MDL) and surplus description length (SDL) to measure the size (in bits) of the probe model. Pimentel et al. (2020a) argues probe complexity and accuracy should be seen as a Pareto trade-off, and propose new metrics to measure probe complexity. All of these papers define ease of extraction in terms of properties of the probe, e.g., its complexity and size.
We argue here for an opposing view of ease of extraction: Instead of focusing on the complexity of the probes, we should define it according to the complexity of the task. We further oper-

Probe
Our experiments focus on Bayesian agents with multi-layer perceptron (MLP) beliefs: 7 As we show later in the paper, this background knowledge about the task can also be formally defined as a Bayesian mutual information, i.e. the information the observed data provides about the model parameters I θ (DN → Θ). 8 Our code is available in https://www.github.com/ rycolab/bayesian-mi. 9 We use the pre-trained models made available by the transformers library (Wolf et al., 2019).
where φ are the MLP parameters, ψ ∈ N |T | is a count vector and θ = [φ; ψ] are the agent's parameters. This agent has a Gaussian prior over parameters φ (with zero mean and standard deviation σ = 10), and a Dirichlet distribution prior over ψ (with concentration parameter α = 1).
As previously discussed, the Gaussian and Dirichlet priors on the parameters will cause these models to initially place a uniform distribution on the output classes-as such, they will have an initial Bayesian MI of zero. We then expose the probe agent to increasingly larger sets of data from the task. Unfortunately, the posterior of eq. (28) has no closed form solution, so we approximate it with the maximum-a-posteriori probability p θ (t | r, θ * ), where θ * = arg max θ∈Θ p θ (θ | d n ). We obtain this MAP estimate using the gradient descent method AdamW (Loshchilov and Hutter, 2019) with a cross-entropy loss and L2 norm regularisation. 10 The posterior predictive belief of eq. (29) has a closed-form solution 11 where count(d N , t) is the number of observed instances of class t. For both analysed tasks, we run 50 experiments with log-linearly increasing data sizes, from 1 instance to the whole language's treebank. For each of these individual experiments, we sample an MLP probe configuration. This probe will have 0, 1, or 2 layers-where 0 layers means a linear probedropout between 0 and 0.5, and hidden size from 32 to 1024 (log distributed). We then use the same architecture to train a probe for each of our analysed representations, plotting their Pareto curves. These curves convey a few interesting results. The first is the intuitive fact that information is much harder to extract with random embeddings, although with enough training data their results slowly converge to near the fastText ones-this can be seen most clearly in English. This matches our theoretical framework: the true mutual information between the target task and either fastText or random embeddings is the same, thus, if our beliefs are well-formed, the Bayesian MI should converge to this value, although with different speeds. The second result is that ALBERT makes information more easily extractable than either BERT or RoBERTa in English, and that multilingual BERT is roughly equally as informative as fastText under the finite data scenarios of the other analysed languages. Finally, the last result goes against one of the claims of Pimentel et al. (2020a), who in light of their flat Pareto curves for POS tagging claimed that we needed harder tasks for probing. One only needs harder tasks if their measure of complexity is not nuanced enough-as we see, even POS tagging is hard under the low-resource scenarios presented in our learning curves. Fig. 3 presents results for dependency arc labelling. These learning curves also present interesting trends. While the POS tagging curves seem to be on the verge of convergence for English, Basque and Turkish, this is not the case for dependency arc labelling. This implies that, as expected, dependency arc labelling is either an inherently harder task, or that the representations encode the necessary information in a harder to extract manner. These results, also highlight the importance of an information-theoretic measure being able to capture negative information-as evinced in Fig. 4. For the low-data scenario, the BERToid models hurt performance, as opposed to helping. This is because high-dimensional representations, together with a weak prior, allow the agent to easily overfit to the little presented evidence. On the other hand, fastText does not present the same problem, having a positive Bayesian MI even in a low-data setting.

An Intuitive Decomposition
We now present some basic results about our framework which, although not strictly necessary for the present study, help motivate it. They also serve as a justification for our choice of cross-entropy when formalising Bayesian entropy. With this in mind, we analyse information from the perspective of a fully Bayesian agent with a well-formed belief. 12 A classic decomposition of the cross-entropy is the following: We posit a new interpretation for this equality. Theorem 6. Let Θ be a parameter-valued random variable. The entropy of a consistent Bayesian agent with well-formed beliefs decomposes as Proof. See App. I.
In other words, the cross-entropy is composed of the sum between the entropy itself-i.e. the "true" information the data source provides, or its inherent uncertainty-and how much information the data provides about its distribution itself.
Relation to SDL. The minimum description length (MDL; Voita and Titov, 2020) is a probing metric defined as H θ (D N ). In its online coding interpretation, it is rewritten as (Rissanen, 1978;Blier and Ollivier, 2018): H θ (D N ) = N n=1 H θ (X | D n−1 )-where the cross-entropy of each element X in D N is computed incrementally because the parameter θ (which would make them independent) is unknown. The surplus description length (SDL; Whitney et al., 2020) is defined as the difference between a dataset's cross-entropy and its entropy: . Using Theorem 6, we derive a new interpretation for SDL: where we use prior predictive distributions, as opposed to posterior predictive ones. From this equation, we find that SDL is the information a dataset gives a Bayesian agent about its model parameters.
While closely related to one another, the Bayesian MI, MDL and SDL converge to different values in the limit of infinite dataset sizes: It is easy to see that MDL goes to infinity as the dataset size grows-H θ (D N ) grows at least linearly with the data size. The reasons behind SDL also exploding as the data increases are less straightforward, though, but become clear from its Bayesian MI interpretation. If the parameter space is continuous, and if the Bayesian belief converges at the limit of infinite data (as per Theorem 4), the Bayesian mutual information in eq. (32) will naturally go to infinity. 13 We thus argue that Bayesian mutual information is a better measure for probing than either MDL or SDL; although all are sensitive to the observed dataset size, Bayesian MI is the only that does not diverge as this size grows.

Conclusion
In this paper we proposed an information-theoretic framework to analyse mutual information from the perspective of a Bayesian agent; we term this Bayesian mutual information. This framework has intuitive properties (at least from a machine learning perspective), which traditional information theory does not, for example: data can be informative, processing can help, and information can hurt.
In the experimental portion of our paper, we use Bayesian mutual information to probe representations for both part-of-speech tagging and dependency arc labelling. We show that ALBERT is the most informative of the analysed representations in English; and high dimensional representations can provide negative information on low data scenarios.

A Ill-formed Beliefs Loose Information
For the sake of argument, we now assume an agent with an ill-defined belief p θ (t | θ) and a prior p θ (θ). We will show that such Bayesian agents loose information, meaning that they will not obtain as much information about their optimal parameters as if they had a well-formed belief.
Theorem 7. Assume θ * are the optimal parameters for a Bayesian agent with ill-formed, but consistent beliefs. The information this agent will receive about its optimal parameters is Proof. This proof follows from the Bayesian MI definition, from this Bayesian agent having consistent beliefs, and from the fact that the cross-entropy is an upper-bound to the entropy, with equality only when both probability distributions are the same-which by definition is not possible p θ (t | θ * ) = p(t) (symmetry due to belief consistency)

B Measures of Information
Several other measures of information have been proposed, among them are the H entropy (DeGroot, 1962), the Rényi entropy (Rényi, 1961;Lenzi et al., 2000), Bayes vulnerability (Alvim et al., 2019), and the Determinantal Mutual Information (DMI; Kong, 2020). None of these take an agent's belief into consideration, and so our analysis is orthogonal to them. The work most similar to ours, in this respect, is Clarkson et al.'s (2005) investigation of how belief impacts information leakage-and its extension, by Hamadou et al. (2010), to the Rényi min-entropy. Importantly, the results obtained by Clarkson et al. can be similarly derived using our framework.

C A Note on Empirical Limitations
Estimating the true MI between two random variables is known to be a hard problem for which several methods have been proposed (for a detailed review, see McAllester and Stratos, 2020)-estimating the Bayesian MI may be equally challenging. Given knowledge of p θ (·) and having access to samples from p(·), the Bayesian MI can be trivially estimated using the Bayesian surprisal's sample mean. On the other hand, in a setting such as active learning, where one (by definition) does not have access to the true distribution p(y | x)-only to the belief-the best approximation to the Bayesian MI may indeed be the belief-MI (used by Houlsby et al. 2011) or the Bayesian surprise (used by Storck et al. 1995and Itti and Baldi 2006, 2009. Finally, approximating the Bayesian MI in the cognitive sciences may be an even harder problem than estimating the true MI, since it would require approximating both the belief p θ (·) of a specific agent and the true distribution p(·) of an event.
D Proof of Symmetric Bayesian Mutual Information, Theorem 1 Theorem 1. An agent's Bayesian mutual information is symmetric, i.e.
for all distributions p(x, y) if and only if the Bayesian agent is consistent.
Proof. We will first prove that if the Bayesian MI is symmetric for all true distributions p(x, y), then the Bayesian agent is consistent (the if case). We then prove the inverse proposition (the only if case), completing this if and only if theorem's proof. Now, we apply the continuous mapping theorem to analyse the convergence of the Bayesian entropy where (1) relies on the continuous mapping theorem. A similar convergence applies to H θ (X | Y, d N ). Finally, we can complete the proof H Proof of the Convergence to V-information, Theorem 5 Theorem 5 Assume a Bayesian agent's beliefs and prior meet the conditions of Kleijn and van der Vaart (2012), who extend the Bernstein-von Mises Theorem to beliefs which are not well-formed. Further, let V = {p θ (· | θ) | p θ (θ) > 0}. Then, Proof. Kleijn and van der Vaart (2012) extend the Bernstein-von Mises Theorem to ill-formed beliefs, showing that, under specific conditions for the Bayesian belief and priors, the predictive posterior distribution converges to lim where θ * is a unique set of parameters which minimises the KL-divergence between p θ (x | θ) and the true distribution p(x), i.e.
p θ (x | θ * ) = arg inf q∈V x∈X p(x) log 1 q(x) Given this convergence property, we can finish the proof similarly to the one for the well-formed belief: where V is defined as {p(x | θ) | p θ (θ) > 0}, and (1) relies on the continuous mapping theorem. We now conclude this proof: