Free energy is a measure of the difference between two probability distributions which is related to Watanabe's empirical entropy and log loss. Bayes' free energy is equal to the negative logarithm of the integral of the product of the data and the prior probability. Whitenup discusses the free energy formula in the realizable case, as well as a more general fact that the formula is still true under certain conditions even if the true distribution cannot be realized. The wbic paper covers the unrealizable case and its hypotheses need to be understood to ensure accuracy.

Free energy is an approximation of the error plus lambda log in, as explained in the green book and the widely applicable Bayesian Information Criterion paper from 2013. KL divergence is a measure of the difference between two probability distributions, and is related to Watanabe's empirical entropy and log loss. Bayes' free energy is the negative logarithm of the integral of the product of the data and the prior probability. It is equal to the integral of the posterior over w, and is equal to m ln w zero plus lambda log n plus o p log log n, where o p is a sequence of random variables of constant order.

Whitenup discusses the free energy formula in the realizable case, where the true distribution can be realized by the model, as well as a more general fact that the formula is still true under certain conditions even if the true distribution cannot be realized. Posterior concentration in two phases is discussed, with concentration near a highly singular point disappearing as the sample size increases. To compare the free energy or posterior behaviour near different points of w0, one can take the integral over v dw pdn phi wdw and define f n of v to be minus log zn of v. The wbic paper covers the unrealizable case and its hypotheses need to be understood to ensure that the statement is true.

Free energy is an approximation of the error plus lambda log in. The proof of this formula is found in the green book and the widely applicable Bayesian Information Criterion paper from 2013. Edmund's Mathematical Theory of Bayesian Statistics is a more recent text which also contains the formula. Main Theorem 6.2 in the green book explains the idea further.

KL divergence is a measure of the difference between two probability distributions, and is defined as the sum of the differences of their respective entropies. Bayes' rule states that the posterior probability of a hypothesis given the data is equal to the prior probability multiplied by the normalization constant. The Bayes' free energy is the negative logarithm of the integral of the product of the data and the prior probability. This is related to Watanabe's empirical entropy and log loss, and is discussed in the Wboc paper.

Free energy is equal to the integral of the posterior over w. This integral is often done over a subset of w, known as v. It has recently been shown that the free energy is equal to m ln w zero plus lambda log n plus o p log log n, where o p is a sequence of random variables of constant order. This means that there is some term of order log log n in the free energy.

Whitenup cites several papers, including a 2010 paper on renormalizable conditions, the Gray Book (2009) and two other papers from 1999 and 2001. The Gray Book proves a free energy formula in the realizable case, when the true distribution can be realized by the model. Whitenup states a more general fact that the formula is still true under some conditions, even if the true distribution cannot be realized.

Posterior concentration in two phases is discussed, with the posterior being concentrated near a highly singular point. As the sample size (n) increases, this point disappears due to the dominance of the term n times ln or n times kn. However, for finite n, it is possible for there to be an area of posterior concentration which is not on w0. The free energy of this area needs to be determined, and this can be done by running the same integral for w prime as for w0.

In order to compare the free energy or posterior behaviour near different points of w0, one can take the integral over v dw pdn phi wdw and define f n of v to be minus log zn of v. This can be done by multiplying the prior with a cutoff function and cutting off v. The wbic paper covers the unrealizable case and it seems like it can be applied to this modified prior. However, it is important to understand the hypotheses in the wbic paper to ensure that the statement is true.

In Chapter 5 of the Green Book, W0 is defined as the set of parameters which minimize the average loss function, minimizing KL divergence. In Theorem 11 of Chapter 5.3, a polynomial is written down for a button. W0 in this case means the set which minimizes the KL divergence. The chapter discusses the standard form and normal crossing form of a resolution of singularity, and provides a number of theorems about normal crossing forms.

In Chapter 3 of a book on Bayesian Statistics, the relationship between true distribution and statistical model is discussed. Regular and singular models are defined, as well as the conditions of realizability and unrealizability. A new concept, essentially unique case, is also introduced for unrealizable cases. The book is more statistical than geometrical, and does not discuss any geometric results.

okay so what's the point of this well um i guess i'll say a little bit first we've only got half an hour so i'll try and be brief there's a question i think we should resolve it maybe it's not even difficult maybe it's already in what nabe's papers it's just a matter of understanding how to use them properly so the question i want to pose is about the free energy approximation and i guess i'll just state that the most general form that i know of is stated in the wbic paper just get that up and then i'll be sure to write it correctly uh so the sort of main texts for singular learning theory uh watanabe's book um as you know algebraic geometry and statistical learning theory there's the green book what's that edmund mathematical theory of bayesian statistics is that that's correct so that's more recent uh i think what i'm about to say probably is in the green book you can tell me um but uh and then there's also an a very important paper which is probably more widely cited than either of those which is um a widely applicable bayesian information criterion so maybe i'll put those up on the board just so we can refer to them so there's uh algebraic geometry and statistical learning theory there's the green book i won't say the title again i just say green book and this is gray book dan sorry to interrupt are you attached to the all oh i'm not yeah thank you well i am but what is thanks yeah let us speak okay so there's a gray book there's the green book uh and then there's the wbic paper which um what year is that 2013. what year is the green book edmund do you know 18 18. yeah okay that must subsume the result i'm about i'm more familiar with the with this paper than the green book right for completeness i think it's 2009 for the great book no thanks okay so the formula i'll write the dummies version first but the formula i'm referring to is free energy is approximately some error plus lambda log in and you know there's there's actually more than that there's lower order terms but this is error this is entropy okay so i'm just gonna let's just call this the free energy formula okay so let me explain the where the proof of the free energy formula is so the most general form well uh maybe i should i don't know how much setup to give i think it'll be useful to give a little bit of setup if we're going to discuss this over several sessions so just bear with me a second [Music] because there's some i mean there's if you look in the in the gray book there's a formula which is main theorem 6.2 and

corollary 6.1 for this that has the empirical entropy in it and then if you look in the wboc paper in that position in the formula there's n ln w hat and those are the same if k n is 0 and so on and that's related to realizable versus unrealisable which is kind of the point here so uh i think it's worth to avoid confusion on those points just um just recapitulating some basics here you'll bear with me and can you guys still hear me i'm actually not in roblox on my laptop i can't do here yep okay good okay so uh so recorded so k n the empirical kl divergence is defined for a sample to be [Music] to be this then if i just expand that it will be so this here is usually denoted uh sn by watanabe this is the empirical entropy of the true distribution and this is log loss or negative log likelihood so this is asn or negative sn rather right there's a w in the second term oh thank you yeah yeah thanks yeah right also in the question okay so um all right that's the first reminder uh [Music] okay so then the posterior um pw given the data set dn is by bayes rule equal to to this here where phi is the prior z n is just the normalization constant so it's the integral over um everything of that term there so this is the partition function but notice that i could replace this term here with kn plus sn and then i'd have e to the minus n k n and i'd have e to the minus n s n and that would be independent of w so that could come out the front of this integral if i switch this for a k n right so there's kind of some some slide differences there depending on whether you use kn or or sn but i think we'll stick with um why don't we write zn0 for the version that just for comparison to the gray book where both appear maybe it's worth writing so if you were to do that you'd get you'd get this which is what um what another calls the normalized evidence or normalized partition function okay um now i'm going to switch to making a statement so this was just a reminder i'm going to switch to making a statement about the free energy that's in the wpic paper this cites a bunch of watanabes papers and the gray book but the proof i guess is in the renormalizable condition paper that edmund and i like to worry about these days um but i'm not i'm not even sure about that so that's one of the things i want to discuss like where actually is the proof of this uh but with the notation i just introduced so ln then the bayes free energy is minus log the integral product i equal one to n okay so how does that relate to what i

just wrote on the board uh well if you look at that formula i guess it should be visible on the screen for zn um if i do e to the minus n lnw well lnw is equal to 1 on n sum over i log something so multiply that multiply that by n and get rid of the minus sign and then uh take e to the power of that i'll just get the product over x i's right so uh this is minus log zn i hope the minus signs worked out right there somebody tell me if they didn't okay so this is the integral over all of w now what we're concerned with i guess here is the fact that well this integral often in physics you don't do it over all of w you do it over some subset of w so you could say v and w is some compact subset you might say something like the free energy of v is minus log the integral of the posterior uh over v the point being there that if you would integrate the posterior to just find out how much probability mass is contained in that over v well that would be e to the minus f v so that kind of acts as a a coarse graining of of the energy something like that so that's the thermodynamic way of thinking about it okay so that's f the free energy i'll just stick with integrating over all of w for the moment but we want to work towards uh something that's you know sort of talking about this for various reasons so what what about says in the wbac paper is that he says recently it was proved that and then afterwards he cites like five papers spread out over 20 years recently it was proved that the free energy is equal to m ln w zero plus lambda log n plus o p log log n uh i never remember what op means so maybe i'll recall it for you in case you're similarly disabled so a sequence of random variables is is op1 if uh i'll just write it i think this is right so think of the uh if you were to visualize the the random variables being on i mean this random variable the norm of xn xn is some vector possibly um that's a random variable valued in non-negative real numbers if the density looks something like this so that's x1 that's x2 you can make sure there's not much mass in the tail for any n uh by uh going far enough along okay so this is uh i mean the probabilistic version of saying it's a it's of constant order so what this means is that there's there's some term here of of order of order log log n so in asymptotics you have you have the n term you have the log n term and then this is a lower order so to to order log log n the free energy is given by by this expression and that is what i

meant earlier when i wrote something like this right so there was some term plus lambda log n and the hypothesis here is uh where w0 the kl divergence from a true distribution to the model and lambda is the rlct of course any questions at this point comments okay so immediately after that statement he cites a bunch of papers uh let me just check them so 2010 a which is uh the renormalizable condition paper i didn't add that to the list maybe i maybe i'll do that just a second so this is the wbac paper so this one's i'm not gonna write the whole thing asymptotic learning curve and renormalizable condition i think it's the only paper that has renormalizable in the title by what another maybe i started saying tell yes so that's the renormalization paper okay and that's that's 2010 a um in the wbic paper so he cites that for this free energy formula he cites 2009 which is of course the gray book so let's see let me put these up on the board with that claim over here so these cites the renormalization paper besides the gray book ah what else does he 2001 a that that's algebraic analysis for non-identifiable learning machines i feel like that's actually not relevant or i think he's just listing a bunch of stuff i don't see how that is that's even i mean that's that's a very old paper in 1999 that's algebraic analysis for singular statistical estimation so i feel like those two i don't see why they're cited i don't think they're relevant someone correct i think there's a sense that like he's citing the genesis of the idea of using algebraic geometry in you know this um just basically explaining where the language comes from yeah i think that's right um okay so let me say what i think is in the gray book uh well so there's there's two things to say right this what problem do i think we should solve uh and what is known maybe i'll say what i want first and then well i'll briefly comment that this formula is definitely proven in the gray book when this here is just equal to nsn well how should i say this that's not quite true okay so the gray book proves this in the realizable case that is when it's possible to realize the model the true distribution by the model so when when k of w zero is equal to zero i mean kn of w0 of course may not be equal to zero but okay so that's the version that whiten up is that's in main theorem 6.2 and corollary 6.1 but he's stating a more general fact here which is that even if you can't realize the true distribution uh it's still true under some conditions

uh maybe i'll come back to those so the wbic paper actually i'm not even maybe actually the wbic paper itself is a is another proof of this i'm not sure um okay so that's another thing i'm confused about um all right well before i be any more precise about what's known and what's not known let me say what i want and why okay so consider consider a situation with with two phases so what i mean by that is an is a a distinct area of posterior concentration we can be more precise than that but that'll do so i'm drawing these kinds of pictures in dlt one two and three but okay so maybe maybe this is w0 well let's let's just stick in the realizable case so maybe there's w0 and if i were to draw the posterior you know the posteriors concentrated along w0 and very concentrated near this highly singular point here but nobody says there isn't other areas where the posterior is of course the posterior depends on dn right so you sample dn and then you get some distribution over the weights well as n goes to infinity this disappears right they can't nothing that's a true parameter can have the posterior concentrated near it as n goes to infinity one way of thinking about that is remember the posterior uh the posterior is is roughly e to the minus the free energy right so uh for the posterior to be large you need this to be large which means you need the free energy to be small the free energy is roughly speaking sort of loss or energy plus lambda log n and this term here is got n times ln or n times kn depending okay so as n goes to infinity uh this term is much more dominant so to have low n you need this to be zero okay so as for n very large this this can't contribute but for n finite it's certainly possible for there to be an area of posterior concentration which is not on w0 and uh you know see liam's thesis for a or a concrete example of that so this really happens and i want to know well what what is the free energy i mean what does the posterior asymptotically look like there or similarly what does the free energy look like over there so i know that on w0 i mean the asymptotic formula we have that i just cited was for the was uh was for the integral over all of w right that's that's what zn was about but you could do the same trick i mean if i were to if there were some multiple com multiple interesting places on w0 i can run the same integral but for w prime and and it all works i believe and gives you a formula which would you know you could say near here you're going to use lambda and near here

you'll you'll see the same formula roughly but with w with lambda prime so that's a way of comparing how the free energy or posterior behaves near different points of w0 but but what about this so my question is what is the free energy formula for phases areas of posterior concentration not on w0 so ie if you were to define let me call this region out here something uh let me call it p maybe not p call it not q ro for density oh yeah i want i want the set itself maybe okay uh thanks okay so i could take zn of v be the integral over v d w p d n phi w d w and then i could define f n of v to be minus log zn of v and i'm asking what is the asymptotic formula for that okay so that's my question uh you know you might think it it could be something like the the earlier formula right it could be that f is something like ln i mean w hat that would be over here maybe uh the point that minimizes the kl divergence on v plus lambda v log in plus o p log log n where lambda v is you know the rlct of of the kl divergence minus this smallest value so this would be non-negative on v and you could blow it up and compute the rlct so that would be you know a reasonable guess and that's a kind of heuristic that's stated in some of my notes and so on and used in some way to try and compare phases etc and but i don't know that's true i think that is true as long as we believe the that that formula holds for the unrealizable case because this is equivalent to changing the uh uh what is it the prior to by multiplying it with a cutoff function kind of cutting off v that's right so it seems uh since the wbic paper covers at least part of the unrealizable case there's some condition it does seem like we should in principle be able to just apply the wbac paper to right that modified prior or if you like changing w to be v yeah i think that's it seems like hopefully true but i i don't know where the statement in the i don't understand the proof of the statement in the wbic paper so i think edmund's right that in some sense i hope that we can just apply that statement but then i i want to be sure that however that's proven that we really know the hypotheses are true um and i i don't want to just take for granted what's stated in the wbic paper so i either want to understand maybe that there's an alternative proof in the wbic paper itself there's certainly something in there that might be relevant or in the renormalizable condition paper or maybe it's in the green book i don't know does the green book reproduce

yeah i think so chapter five so the green book sort of start off with um it didn't start with the realizable assumption and in many cases um w0 has a different meaning meaning it's frequently uh the button the the pyramid the set of parameters which minimize the average loss function so so basically w is minimizing kl divergence um and then in theorem 11 in chapter 5.3 um that polynomial that you just wrote up that is written down for a button so where is that that's i'll just chapter 5. 5.3 um theorem 11 chapter 5 and 11 page 153 if you are using the same version okay so green book sorry say that again chapter five yeah uh theorem eleven zero eleven okay so um so in that case w0 it just means the set which minimizes the kl divergence is that right capital w0 yes yeah okay so you think that just uh well okay so you think if we apply that but we change the prior in some sneaky way or we just take the model to be v as the space of parameters um yep yeah okay it's a little bit meta though like i don't know that like it it definitely needs to need us to go through it in a careful way to see whether anything breaks yeah i think that's a good starting point so uh maybe that's that's the question mark to start with and that that comes with understanding the um understanding the proof uh i'm sorry i didn't notice how much time we'd used i'm gonna have to go to start up the the code session in discord but feel free to keep thinking about this are there any quick questions before i go okay um yeah sorry to leave unceremoniously but um you could ask edmond for clarifications i think if you if you want to understand this further and we'll continue uh next week so thanks everyone thank you thanks sam [Music] man so in the green book he really labels the theorems and propositions separately what do you mean theorems and positions theorem 11. and i'm saying definition 16 or maybe i'm wrong what is your preferred uh just everything in this in in this is anyways find this as long as i can find it without having to keep track of like the last theorem right oh i see there are nine all right page 153 thanks he just got it all right cool also a word of warning though that that chapter is um is before he even discussed any um what any resolution of singularity basically he says let's look at the standard form let's look at normal crossing form and then a bunch of theorems about normal crossing forms and before and in the next chapter he says something like by the way this is this normal crossing

form is the most general you can get because you can always blow things up so so yeah and also this is uh uh we still have the problem of localizing uh so that there's no charts or anything in here um so that's that's in the next um chapter as well i guess uh i guess i'll read this sometime this week right yeah it's a chapter five more or less i mean i guess i'll have to go back and see where where it makes sense to start reading because this is like okay okay uh this book is a bit more statistical than geometrical um yeah yeah it discussed it it doesn't quite discuss any of the geometric results it just cites them particularly goalling is that at some point he comments that the gray book is for non-mathematician the great book is what it's for non-mathematician that was yeah that that uh that undermines my confidence a little bit it's pretty funny yeah i haven't looked at the green book i mean i downloaded it at some point because i'm looking am i obviously you guys mentioned it um all right i guess i guess chapter five is probably going to be a good read i'll probably start there and take a look yeah yeah um i think chapter five is um i haven't read through it like very carefully um uh yeah i think chapter 5 is definitely so is is rather independent from previous chapter okay previous chapter discussed like um what about bayesian statistics that we care that we should care about um and chapters by discuss um that form specifically the normal crossing form okay but then i i guess uh i guess do be careful though that uh the green book uh since chapter one um uh did not have the realizability assumption um straight away so um maybe where is that hmm where is that um [Applause] right so so in chapter three i think chapter three is where he lays down um the things that we care about like um like the the observables the uh generalization error et cetera and in chapter three um in the first yeah in the first field in the first view um pages he deals with the relationship between true distribution and statistical model so basically that's where he did define regular model singular model and that's the same as the grade book however the thing that is drastically different from the grebel is that he defines the conditions realizability and realizability and uh he covered another huge part of unrealizable case uh something he called i guess it's like the re-normalizable case but he calls something like essentially unique case so basically when things are not realizable