This video explains the concepts of free energy, entropy, relative finite variance and KL divergence in the context of the Status Theorem. It shows how the average log loss and density ratio can be used to measure the distance between two distributions and how Watanabe's condition can be applied to ensure essential uniqueness. The Status Theorem is the most general form of the theorem, and regularity and realizability imply Watanabe's condition.

A picture is drawn to illustrate a true distribution outside of a region of posterior concentration away from a point w0. To understand when a free energy asymptotic formula can be used to describe the free energy defined as the negative log partition function, a condition weaker than realizability called relatively finite variance is applied to measure the distance between w0 and w. The log density ratio f(x, w0, w) is used to measure the density between w0 and w, and minimizing the average log loss l is the same as minimizing k, which is the difference between the entropy of the true distribution and qx log px w dx.

KL divergence is a measure of the difference between two distributions, pxw0 and pxw. Watanabe's relatively finite variance condition implies essential uniqueness, meaning that for any parameters w and w0, the distributions p(x|w) and p(x|w0) are the same. This is measured by the expected difference in the number of bits needed to encode measurements from each distribution, calculated by taking the integral of qx log pxw0 over a neighbourhood around w0 and w. The Status Theorem states that the free energy asymptotic formula is the most general form of the theorem. Regularity and realizability imply Watanabe's condition.

The aim is to understand when a free energy asymptotic formula can be used to describe the free energy defined as the negative log partition function. This formula is derived from statements in the book Mathematical Theory of Bayesian Statistics by Watanabe. The formula is only applicable when the true distribution is realizable, but we are also interested in the more general situation where the posterior is away from the true distribution. Definitions from the book are used to understand this situation, such as the concept of relatively finite variance.

A picture is drawn to illustrate a true distribution outside of a region of posterior concentration away from a point w0. The log density ratio f(x, w0, w) is used to measure the density between w0 and w. Minimizing the average log loss l is the same as minimizing k, which is the difference between the entropy of the true distribution and qx log px w dx. A condition weaker than realizable, called relatively finite, is imposed to measure the distance between w0 and w.

Model has relatively finite variance if there exists a constant such that for any pair w0 and w, the integral of the log density ratio over qx satisfies this condition. This means that 1 over c0 is an upper bound for this term. The variance is finite relative to this term on the bottom, which is a substitute for l, but using w0 instead of the true distribution. The difference in the KL divergence is measured by the distance between w0 and w.

KL divergence is a measure of the difference between two distributions, pxw0 and pxw. This can be thought of as the expected difference in the number of bits needed to encode measurements from each distribution. To calculate this, the integral of qx log pxw0 is taken over a neighbourhood around w0 and w. This quantity provides an information theoretic measure of the distance between the two distributions.

Watanabe's relatively finite variance condition implies essential uniqueness, which states that for any parameters w and w0, the distributions p(x|w) and p(x|w0) are the same. Watanabe also proved that realizability and regularity both imply this condition. The Status Theorem states that the free energy asymptotic formula is the most general form of the theorem.

In the realizable case, the model truth prior triplet is in standard form if a function, a(x,w), exists with a positive b(w) on a neighbourhood of zero and the log density ratio function is given by a(x,w) times some monomial and k is given by a formula. Theorem 11 of Section 5.3 of the Green Book states that if the model is relatively finite variance and in standard form, the free energy has an asymptotic expansion with a random variable plus a term converging to zero in probability as n goes to infinity.

okay so working session two so recall that the aim for the moment is to in what situations we can use this free energy asymptotic formula which is roughly speaking says something like um something like this whether we can use it to describe the free energy defined as the negative log partition function where the partition function is the integral over some region of w and the way that what what talks about it and the way that edmund was just talking about it is that the the partition function is an integral over all of w but we can define it alternatively as an integral over some region v and if v happens to contain uh some singularity then in integrating over v and then taking the negative negative logarithm um gives us a function of v and we want to know when we can use a similar formula to this one to talk about f n of v okay so that was what i introduced last time i'll say no more about it what i want to do now is uh is clarify some suggestions that edmund made last time for trying to derive this from statements that are already in the green book so remember the green book is mathematical theory of bayesian statistics by watanabe the wbic paper which we also talked about last time has a version of this free energy asymptotic formula and i'll just briefly recall that in the gray book you have this formula star but only in the case where the true distribution is realizable so um that's fine in the picture i drew right in the picture i drew this singular set was meant to be w0 and if i restrict my consideration to v while v still contains a parameter that gives the true distribution it intersects w0 so we can use the results in the gray book in principle to talk about this asymptotic expansion of the free energy but we're also interested in a more general situation where we're just talking about a local concentration of the posterior which is away from w0 as does happen in practice and we want to know under what conditions can we get something like the formula star for the negative logarithm of the integral of the posterior over a region like that say v prime maybe just so in the case where n goes to infinity of course you'd expect that to go to zero or um right the posterior can't concentrate away from w0 as long as it's not empty if as n goes to infinity but for finite n we are interested in that okay so i'm just going to state some definitions from the green book which we're going to use and the first one is relatively finite variance so okay so from the green book

so i'm not going to the notation is the standard notation so p and q and w and x uh i'll just take those as understood so consider for a pair w 0 and w in the space of parameters the log density ratio that's just a name f x w 0 w which is the thing you'd write down in order to integrate over x to compute right so if if px w0 with a true distribution uh this is the density meaning the thing that you integrate to give kw right so kw is the integral over qx fx w0w dx so it's a density for the kl divergence but i'm not assuming that w0 i mean in this formula the log density ratio w0 is just any other point of w right but i'm thinking about it as the thing i'm using to measure the density uh okay maybe a picture helps here so i'm assuming there is a true distribution but i'm not assuming it's realizable so how should i draw this picture so here's qx and i'm considering now multiple points well maybe i should draw it a bit better than that so for the reasons i just elaborated we want to consider the case where the true distribution is not realizable because that's what's happening here doesn't intersect w0 if we consider a phase that is a region of posterior concentration away from w0 then the non-realizable case becomes important the green book addresses this to a greater degree than the gray book and to i think it's more or less at the same level as the wbic paper and maybe it's been cleaned up terminologically okay uh so suppose the true distribution is out here and this is a picture whatever it draws so what we're going to consider is so in the green book w zero doesn't mean where k is zero it means those parameters that minimize the average log loss maybe i'll just put up the definition of l here uh let's remember l l is the average over the ln right so that's average log loss differs by the entropy of the true distribution from k so q x log p x w d x okay um so minimizing this is the same as minimizing k so as i've drawn this picture to the right these points w0 and w0 prime how you're supposed to read w itself well you could read it as the area to the right i guess which i'll draw in green i suppose and so this stuff over here is all w and w 0 and w 0 prime are the closest that you can get to the true distribution within w so the realizable case would be where actually that distance here is zero for one of these parameters and therefore for both right um but in between so we can impose a condition which is a bit weaker than realizable which is what what nabe calls relatively finite

variance so we say i forget edmund the precise terminology here is that the true distribution has relatively finite variance for the model or that the triple has relative finite variance i can't quite remember uh i think it's just um actually it's just the model isn't it not even the prior oh that's right uh yeah there's no there's no role for the prior that's right yeah yep related finite variance of the log density ratio function right literally yeah yeah so it depends on pq so let's just say the pair has relatively finite variance wait not not even q well q is in l right i don't know how we would make sense of w0 ah oh it doesn't know that it's just any yeah yeah yeah okay that's right yeah yep yeah good glad i asked should we start calling um the w0 that you wrote that's like w optimal or w pass or something it's good yeah i agree um yeah i like i like that suggestion let's call it w optimal okay [Music] so we say the model has relatively finite variance if and this i don't think this will need explanation so if there exists a constant such that for any pair so this is the integral over qx of of this log density ratio satisfies this i have a question yeah so you define the log density ratio as for any pair w naught w but but in the book is w naught is in w optimal um does does that make a difference oh yeah you're right sorry i mean yeah in all of this w zero should be w optimal yeah thank you yeah i guess the error doesn't matter well we don't uh i think for what i just said it it's fine but for the statement that wants to interpret this i guess it's necessary for it to [Music] right for w0 and w optimal and w in w this condition okay so i don't claim to really understand what this means but the way to read it is to divide through by the variance on the right hand side so relatively finite variance well what does the finiteness mean uh it means that 1 over c0 is an upper bound for for this term here i'm on the second board now so the variance is finite relative to um i mean this term on the bottom is kind of a substitute for for l right but where you use w0 instead of the true distribution so i'll update the picture in a second so how do i think about the denominator let's not worry about the numerator for a moment so i've got my picture here's qx uh here's my w0 suppose i pick a w here right so remember everything to the right of that line is meant to be interpreted as w so think about this distance here so that's i mean if i was to measure the difference in the kl divergence

uh so dkl from from this to this so that's the integral of log okay but what is uh what is this term here well by definition that's the integral over qx log that density ratio which is which is this here okay so uh it's not the kl i mean the this quantity here is not the kl divergence between w and w0 right but it's um it's this this quantity so using using the actual true distribution we're pointing out that if it is realizable it reduced to k of w that's right yeah so in the realizable case so that um so that this is actually the distribution p x w 0 is equal to q of x i guess i'm drawing this picture in a weird way right over here i've got parameters and over here i've got distributions um i don't know i don't know how to fix the picture exactly okay i could get rid of the label it might be worth considering um since since we are considering you know integrating in different regions of w anyway we might just draw kind of a bigger w and then have q to be somewhere and then we are restricting ourselves to a subregion to talk about that i think that's good yeah so let's just call this point out here let's just call it v and we're assuming here that p x v is q x so i think that's that's reasonable um yeah as you say so we're sort of assuming that the true distribution is realizable but in a larger class of parameters than we're currently considering yeah that's a good idea okay so uh this quantity in the denominator is not the kl divergence but let's think about it as nonetheless a measure of the which it is right of the the difference between these two distributions p x w 0 and p x w it's you know the expected difference in the number of bits um that you uh so how do we think about this information theoretically so uh that difference so the q x log p x w 0 gives me the expected number of bits i need to encode measurements that i take [Music] that i receive from a distribution px w0 and likewise for the other term um and so it's the difference so it has the information theoretic content of the difference in the number of bits i need to encode observations from these two distributions that's a reasonable measure of the distance and information theoretic terms okay so but consider w as being close to w0 right so i want you to think about a little neighborhood here around w0 and around w0 prime okay so we're taking a point w here and we're looking at [Music] so relative to that distance between w and w0 we're looking at the variance of of this log density ratio

i don't actually know how to think about this ratio to be honest i don't know if you have a good way of thinking about it edmund um it's something that you know i understand how it appears technically but i don't really know how to think about it neither do i yeah i think it's a clever definition maybe not completely obvious because it isn't how watanabe set it up originally right okay so anyway relatively finite variance uh what it says is that is that this ratio is bounded by a constant independent of w zero w zero prime etc okay so maybe there's many of these points going off right the way i'm drawing them they look like they're getting further and further away but you should sort of view them as being clustered around v maybe and the behavior is simultaneously bounded by a single constant c0 that's relatively finite variance now what watanabe shows is that if you have this condition well i guess the first thing to say is that realizability implies this definition is on page 72 by the way so he proves that relatively finite variance is implied by realizability what happened dan yeah i got disconnected can you hear me now yeah i can hear you yeah that's weird um i still can't see anything i guess i can keep talking while it's reconnecting sorry about that everyone i got disconnected for some reason um how long was i talking into a blank microphone for like 10 seconds okay no problem then okay um all right so relatively finite variance is implied by realizability and it's also implied by regularity and relatively finite variance implies what what nabe calls essential uniqueness which is not that the parameter is unique the one that determines the closest point to v that would be you know something like regularity i guess but essential uniqueness means that the distributions are the same so essential uniqueness says that p x w 0 is p x w 0 prime for all w and w 0 prime and w 0. a w opt i hope i didn't do that anywhere else so that is that uh taking this and this and this parameter they all give you uh the same distribution right which is not automatic otherwise okay that's the setup for stating the theorem that uh kenneth edmond was referring to last time so maybe i only have time to state that which is the sort of free energy asymptotic formula uh but in the sort of most general form that i know that it's published maybe edmund can correct me but so that's that status theorem and this is let's see so this is page 137 i might write something on the other board oh please here go ahead

uh yeah this isn't meant to be me giving a lecture right that's just so yeah feel free to anybody should derail this and interrupt at any point um okay so uh i guess i need one more definition so assuming this isn't the theorem this is the definition this is standard form which is i guess what edmund is kind of assuming assuming a relatively finite variance so that means that we can without so it's well defined to write down p x w 0 or w 0 and w opt right because it's independent of the choice of w 0. so that means i can write this down and i can define k w to be this thing here so in the realizable case that's what we're used to as being kw but otherwise it's it's something new right because this numerator here is not the same as qx in the non-realizable case so just the flag that this this k is now something more general than what we've usually discussing okay so we say so the model truth prior triplet is in standard form so this is the target of the resolution procedure if uh functions a x w b w exists with b w greater than zero uh well i guess uh the origin is supposed to be in w if there exists bw positive on a neighborhood of zero and the log density ratio function is given by axw times some monomial and k is given by this and phi the prior is given by this so k is k and h are multi-indices okay so this exists by a resolution of singularities but we say it's in standard form if that's okay now let me try and state the theorem this is i have to find it bear with me uh theorem 11. i forget which section 5.3 oh cool yeah thanks evan where should i state this uh maybe i'll stand at the bottom of the board i can get rid of this exists oh you can erase my things i just want to point out that one interpretation of that is the difference yeah right yeah um of the kl yeah it's k it is equal to k of w minus k of w naught that's a good way to think about it yeah okay so theorem eleven what did i say section 5.3 green book page 153 says that uh if the above holds that is the model is relatively finite variance and in standard form uh being in standard form i guess we fixed by resolution but the relative finite variance is a is an actual hypothesis then um the free energy defined as before has asymptotic expansion fn is nlnw0 again that's independent of w0 in w opt right you know minus minus log some stuff this is the random variable plus e something converging to zero in probability as n goes to infinity okay so that's the most general form of the asymptotic expansion of the free