WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


Free energy is a learning theory technique used to calculate the probability of a parameter given observations. Bayesian methods are used to predict the probability of a new observation. Delta distribution can be used to rewrite an integral, which is then Laplace transformed to obtain the zeta function. This result can be used to better understand the relationship between the likelihood and prior, allowing for more accurate predictions. Free energy is an important concept in the field of learning theory.

Short Summary


Free energy is a technique used in learning theory to calculate the negative log of the partition function, which is the probability of a parameter given observations. It is composed of the likelihood of observing the data and the prior, and is factored by the true distribution. The average log loss is equal to the negative of the likelihood, and Bayesian statistics is a model selection technique used to compare two models by assessing their evidence and likelihood. Free energy is an important concept in this field and is used for model selection.
Bayesian methods are used to predict the probability of a new observation, with the generalization error being the KL divergence between this prediction and the true value. Quenching is a mathematical operation which reduces stochastic quantities to central mathematical quantities, allowing for the calculation of the kl metric on the parameter manifold and the zn and fns. The normalized quench quantity is equal to lambda log n minus n minus 1 log log n plus something, demonstrating that the two versions of free energy are asymptotically equivalent. This result can be used to extract the leading order term, lambda plus something that decreases faster than one.
Delta distribution can be used to rewrite an integral, which is then used to calculate the density of states. This density can then be Laplace transformed to obtain the zeta function. An example of a uniform prior on a unit square is used to illustrate this, with a metric function given by k of x1. The asymptotics of the Laplace transform are found by changing the variable from t to t/n and splitting the integral into two parts. The integral from n to infinity is shown to go to 0 as n goes to infinity.

Long Summary


Free energy is the negative log of the partition function, which is the probability of a parameter given observations. It is calculated using a technique called singular integral and is used in learning theory. The partition function is composed of the likelihood of observing the data and the prior, which is normalized by the partition function. The partition function is factored by the true distribution and each observation is assumed to be independent and drawn from the same distribution.
The average log loss is equal to the negative of the likelihood and is also known as the partition function or the model evidence. It is a probability density over the data set, and when the data set is fixed, it is also a likelihood for the model. The posterior can be rewritten to absorb the prior and can be thought of as a Hamiltonian with a temperature term of one.
Bayesian statistics is a model selection technique used to compare two models, p1 and p2, by assessing their evidence and likelihood. It is based on the integral of a function space, and is made more rigorous by introducing hyper parameters. Free energy is an important concept in this field and is also used for model selection.
Bayesian methods are used to predict the probability of a new observation. The base generalization error is the KL divergence between this prediction and the truth. The expected increase in free energy of a dataset is equal to the expectation of a random variable. Normalized quantities such as the partition function and free energy can be used to minimize or maximize a model. The quantity w naught is the entropy of the true distribution, and is minimized to find the zero of the quantity.
Quenching is a mathematical operation that takes the expectation of stochastic quantities and reduces them to central mathematical quantities. The quantity ln is reduced to l, and the negative log of the same is reduced to f. These quantities can then be normalized or quenched further to get the kl metric on the parameter manifold, and the zn and fns. This is equal to the negative n kw integral over the prior, which is used to calculate the average log loss.
In order to understand the practical aspects of statistical learning, it is necessary to study the normalized and quenched quantities. There is a theorem which states that the stochastic version of free energy, when normalized, is bounded above by the quench version. A stronger result is that the two versions are asymptotically equivalent. This is demonstrated by the theorem that the normalized quench quantity is equal to lambda log n minus n minus 1 log log n plus something. This implies that the first two terms are the leading order terms and the lambda is the RLCT. This result can be used to extract the leading order term which is lambda plus something that decreases faster than one.
Delta distribution can be used to rewrite an integral term, which is then used to calculate the density of states. The density of states can then be Laplace transformed, resulting in the zeta function of the model. This is illustrated using an example of a uniform prior on a unit square, with a metric function given by k of x1.
Zeta function is evaluated over a unit square, reducing to two integrals. Inverse Milling transform is used to find the density of states function, which is then Laplace transformed to get the Zn integral. Lemma is given in generality to calculate the Milling transform of a double function, which is done by integrating by parts.
A change of variable from t to t/n is used to extract the asymptotics of a Laplace transform of a density of states function. This leads to a log n multiplied by an integral from 0 to infinity of t to the negative half e to the t dt minus one fourth root n. The integral from 0 to n is split into two parts and the integral from n to infinity is shown to go to 0 as n goes to infinity.
A singular integral was evaluated by using Laplace transform and the density of states function. The example used two variables, h and k, both of which were constants and the result was a half. To generalize this to an arbitrary number of variables, the zeta function was factorized into individual integrals and evaluated, giving the same result. This technique can be used to evaluate singular integrals.
In order to evaluate the leading term of a partition function, relevant modifications are added to the expression, including two smooth functions. The partition function is then expressed as a difference from the quench version, with k n replaced by k plus the difference between k n and k and a square root taken out. This allows the Central Limit Theorem to apply.

Raw Transcript


today we are going to talk about free energy and i'm going to talk about what it is why do we care and [Music] introduce techniques to calculate its asymptotics as the number of samples goes to infinity so this is what um what you see on the board here is kind of what the sequence of steps towards the general free energy um and that's the partition function for the free energy the negative logo that this is the steps towards going from a simple things to a simple uh form of the free energy to something that is much more complicated that is actually real that we care about so but today hopefully we will get towards get to um that bottom layer there okay sorry i'm jumping to the next board immediately okay so um let me just write down the goals so i want to remind ourselves so what is free energy um and why is it called the free energy why do we care yep and uh introduce a core technique or a core observation that allows that is used in the gray book and using similar learning theory to evaluate um to evaluate singular integral and the sequence of integrals that you we see in the last board are examples of singular integrals um and and then uh we'll hopefully start climbing the ladder start to decline the ladder okay um so what is free energy definition the free energy of a statistical model uh plus prior pair given so fixing a set of observation or data set tien um is the negative log of the partition function okay now what is the partition function let's let me define that in the next board oops the off cam is not correct yeah that's good oops [Music] okay so to define the partition function so or the evidence of a variation model we recall that the posterior distribution well it's a density on the parameter space uh w which we recall is a subset of the dimensional euclidean space is given by this expression so the probability of some parameter given the observations is the likelihood of observing those observations at that parameter times the prior normalized by something that we call the partition function so that thing is what we call evidence or partition function and that's the prior which will denote d and that's the likelihood which is equal to by the iid assumption meaning each observation is independent and drawn from the same distribution known as the true distribution uh in that case it will factorize actually i think uh that was uh that it was an id from uh the queue that uh from the truth that gives this vectorization it's actually conditional
um independence condition on the parameter that gives its spectralization um anyway with that we can rewrite posterior into the following form and that will be done and this will be the notation that we'll be using trial i'll define the terms here so that's the average log loss which is equal to negative one on end so it's the negative of that thing the likelihood um and that is the standard physics notation for partition function it is it has many names it's just a normalizing vector well because the left hand side is a probability density on w so if you integrate the right hand side over w it should be one and with that property makes it the fact that it is you can obtain the quantity by marginalized out the parameter also makes it the marginal likelihood so these characterization give us the expression zn is the partition function is the integral over the parameter space of the numerator basically it is also called the model evidence so notice that that n if we um recall from above here is the probability of observing the data set when what when we marginalize out the parameter and this is indeed a probability so if we integrate um it over um the observations it should be one so this is uh it is a probability density over dn over the data set um uh and this probability depends on if we switch perspective as we often do with likelihood function and think of the dns fixed and what's varying is the model and the prior then this is also can be thought of as a likelihood for the model entire um i think i'm not clear what kind of mechanical difficulties will arise when we are talking about likelihood over a function space so um take that with a grain of salt um and uh i'm going to erase that this part whoops uh erasing is not easy isn't it oh okay sorry okay um i'll erase this part to to give the um another name for it which is fine and final name for it which is the partition function name um well that's because it is very much like the bozeman distribution in statistical physics and in that case we actually call it the partition function we can rewrite the posterior in the following form just to absorb the prior into the above and we can think of this as as a hamiltonian well uh the temperature term is kind of missing that's kind of why uh in the book it we generalize to include a temperature term generalized posterior to include temperature term but for our purpose here we will stick with temperature equal to one um which is the strict variation positive um
right so um what good is it for well as you can see it is rediscovered in many different forms it has many different names um well important things have many different names so and many many people care about it so that's one reason for us to care about it another reason is that um it is used for for example for model selection so um a really good model um say uh so good model so if i have another model p1 um well on the same parameter space with its own prior compared to another model p2 well if one of them has higher evidence matching likelihood etc then it is it is a better model um uh that's that's kind of the criterion that is used in bayesian statistics to do uh model selection um and ask you about this this seems like it sort of depends in some way on the statement you erased right the integral of a function space being one otherwise it's not like the different model prior pairs really live in the same universe except in that statement right i guess i don't really understand why i have to accept this as a basis of model selection who forces me to do this yeah so i'm i'm a bit tentative when i ex when i say those words as well um so one uh i think one way um to discuss this and i think watanabe seems to sometimes take this prosthetic as well is to introduce hyper parameter so basically there is another um another distribution over um the set of model prior pairs um and and then and then you can actually say that um uh the the evidence which is now itself a conditional uh distribution conditional on the hyper hyper prior um is the probability of um the model prior being uh is the probability of the multiplier given the dataset so so in that case you can um say that way it's the same um kind of difficulty where we when we do maximum likelihood estimates so we are saying that maximum likelihood is is the true answer uh but really we should be uh it's a dude that the usual argument is it's a true the true answer because it is um it is the most probable uh parameter but that's kind of a posterior distribution statement um in which case uh if you just put a uniform uniform prior on the space then then that statement is equivalent to the posterior statement but yeah so um so there are ways to make this somewhat rigorous but it's um the mathematical difficulties is not mapped out yet okay so let's go to the next board okay um just to mention uh one final reason um well for us today to care about the free um the free energy so recall that we we've just defined the partition
function and evidence and um the free energy is negative log of that so f n is negative log of z n just to write that down um then the debasion the base generalization error is well bg so just to recall what that is so the the base the bayesian method of doing statistical estimation is to so we want to predict the probability of a new observation well one way to do it is to ask what each parameter think of the um think of the observation and average that overall parameter so that's just the destination posterior that's sorry the destination predictive distribution and the generalization error is just the kl divergence between this and the truth um okay so notice that this prediction depends on dn which seems as such it itself is a random variable then theorem its expectation well over the data set the thing that's giving it it the randomness is equal to the expected increase in uh in free energy of sorts let me just write down and explain well there is this zero up there which which is a quantity related to the free energy but it's some somehow normalized so i'm going to have a digression on kind of a necessary digression on normalization and i'm going to use a fancy physics word which don't really understand the full implication and quenching okay so recall that we now have various quantities that we are interested in there is the average lot loss there is the partition function and there is the free energy well these are what i would probably call practical quantities so once you have the form of your model and the prior you can actually write down these quantities and and if you have a data set then these are actual numbers to you um and these are quantities to be um minimized or maximized max e mice right but for theoretical purposes it might be more convenient to have normalized quantities and that gives us things like kn w which is given by the quantity from the left-hand side but normalized by its minima so this quantity is now instead of being min maxed well from the left hand side this thing is to be minimized this instead is fine we want it theoretically to find its zero find the zero of it um and the the way i write write it here where w naught is uh is in the set of minimum this perspective kind of generalized to the unrealizable case if it is realizable then w node is uh w is a true parameter and this just becomes the um entropy of the true distribution the average the empirical entropy the true distribution please do ask me question if anything is
unclear um and then there is the zn normalized to z and not um which is equal to um that end but normalized by um okay i'm going to stick with i'm going to stick with the uh the minima formulation so again if it is realizable this is equal to so w0 is the vanishing of the average log loss right so why is is it really true that ln w0 has to be sn oh uh no no uh there are virtual sorry i think i have a slightly different um terminology so so um it's it's the vanishing uh w0 is the vanishing sorry that is the minima of l of zero the l function which is just yeah okay that thing but if it is realizable at w zero um it is just the key is just equal to q p of x given w is equal to q then that becomes the entropy yep so it's equal to q is realizable and then the quantity uh above i said in the theorem above f and not is just the negative log of z and naught which is equal to um f n the unnormalized thing um normalized by that which again becomes s n uh the entropy empirical entropy when it is realizable so these are theoretical quantities and in fact there is another operation we can do on these quantities which is called quenching which of which this l of w quantity is one of them so basically instead of looking at so the quantity above here are stochastic so the whole first row are stochastic and the bottom row will be things that have been taken expectation of sorry stochastic things has been uh averaged out and what's left is kind of the central mathematical quantities uh that is as you can tell probably useful for mathematical analysis where so the ln becomes ln is the first ln is a random variable depending on the data set and you take the expectation um over the data set and you get this quantity uh the other two quantity though is not is is actually not named so but i will just write out their form of what quenching meant in that in their case so that m becomes uh becomes that quantity for which the for which this um goes from ln to l um and there is the negative log of the same for f well this quantity can themselves be normalized or quenched from above and you get the quantities that we care about today um and it's the kow quantity that we have been talking about for a while which is the kl metric on the parameter manifold and there are two other things which is the zn and fns that's normalize and quench from the three quantities on the top left and that's equal to e to the negative n kw integrate over at the money four against the prior so that you can either think of
the l the ln on the left get normalized or you can think of the above k and the stochastic version and get quenched and f n is just negative log of that n okay so we are going to focus on that today we will focus on how to evaluate that and how to how to relate this mathematical integral to their stochastic counterpart let's go back to the first board okay oh let me clear that okay the upshot is we study the normalize and quench quantities as a first step to understand the practical quantities and how they behave when we actually do statistical learning with real data well let's look at some easy result um on the relationship between these two kind of quantities we have a theorem which is easy-ish not too hard to prove it's theorem 6.4 in facebook it states that the the stochastic version the practical stochastic version when normalized the free energy the stochastic version of free energy when normalized is bounded above by the quench version this just relies on uh uh jensen's jensen's i believe that's how it's pronounced inequality in equality so that's not too hard to prove i think i think one can have a look at that in my facebook well the stronger result that we actually want is that the above quantities doesn't they are asymptotically equivalent so not only is it bounded above by the quench quantity um asymptotically as n goes to infinity they are um of the same order of magnitude they're all the same order of magnitude right and and that's kind of the main sort of summarized by the main theorem two uh in what's in the facebook and i will just write out what it meant in our context is that that quantity above is equal to lambda log n minus n minus 1 log log n plus something and that thing is other one so which means that these the the first two terms are the leading order terms and the lambda is the rlct and that's the order a corollary of this theorem is that recall above that we have that is equal to the increase of free energy whoops and using the formula above you can extract the leading order term and that's lambda on them plus something that decreases way faster than one of them okay so we want to get to this result um and we are going to take uh literally just the first step today let's start that on the next board oops okay okay right so we want to evaluate um set end let me write it out large and clear equal to that expression okay now we are going to introduce a core observation that makes this work which i think is kind of a genius innovation in srt
that's equal to um so i'm going to rewrite this term here by using the delta distribution so that term that underlying term has been has turned into this integral where the delta is the dirac delta distribution um i'm going to put on the physicist hat here and not sort of define this rigorously for for this talk so the the idea is that for any particular value of w this k of w is just a single number and the direct delta the defining property of the delta is that you replace anything it is kind of the evaluation so so delta x f of x is equal to f of zero it evaluates the function that you integrate against at um uh so um at zero and change of coordinates um shifting coordinates makes it uh makes this two expression equal and then there is um if you start the integration order maybe it's just worth saying that such a distribution exists it's not trivial to construct and depends on for example the fact that most fibers of k are angular and so on so there's some work there but that's not the subject of today yeah so that's definitely not yeah not trivial to define um yeah and the next operation is also not trivial to justify uh is the fact that you can stop the integral um so if you stop the integral you get this expression and that is what dan introduced last time that's the density of state um where k is treated like k is positive it's treated like an energy function and we call that v of t is a function of t and that operation there integrate against that is the laplace transform okay well the next observation is that that density of states itself can if we do mailing transform on the density of state uh and the mailing transform of a function f is defined to be the following integral transform so if we mainly transform the density of states we get so remember that v of t is the density of states is itself an integral and we substitute in the integral of both and again sort the order of integration a lot of justification required we get minus k of w t to the z dt and that's equal to the zeta function of our model and the point is that this is easy well if k of w is normal crossing and i will illustrate that with an example let's go to the next part so let's look at all this in action using an example this is example 4.7 in what on facebook um we have k of x1 our our kl metric function is just given by that um and our prior is given by something that's extremely simple it's just uniform the uniform distribution the uniform density on the unit square and zero elsewhere
well in that case our goal the zn function that we would like to evaluate reduce to this integral over the unit square and we want to know the asymptotic behavior of this as n goes to infinity by the way if you tailor expand this and then integrate term by term [Music] well it is not asymptotic as n goes to infinity because that that works for and small and large um so that's kind of that's the kind of the first thing i tried and didn't work and and then and now i know why um so let's use the technique that we been looking at just moments ago we want we first evaluate the zeta function so the zeta function in this case just becomes so that's that's k raised to the power of that and integrate over the x1 and x2 and notice that it factorized into two integral and you can evaluate those independently and that's just equal to both of them are equal to two z plus one uh so that's two of them so two that's plus one squared just to make the um the complex structure clear let's write it this way um and clearly there's a pole at negative half well now we can the next thing to do is to find the inverse mating transform because as we recall um that gives us um the dense density of states function which will then laplace transform to get um the the z and integral of everyone um to find the inverse milling transform um one can you know look up a table of mating transom and its properties to massage um massage into that form the form of the zeta function so someone has done that and that's dilemma so the mailing transform of the following function 1 on m minus 1 factorial it's going to be a complicated expression but bear with me multiply by log of a on t to the n minus 1. so m is a natural number and a is a number greater than zero lambda is another lambda number greater than zero and this function is defined like that for um t between 0 and a and f of t equal to 0 for t greater than a then the main transform is given by f of z equal to a to the z over z plus lambda um okay so the proof okay i'm actually not going to subject you to the calculation but so lookup table of mailing transform or integration by thought which is kind of how some of the properties of the um n transform is derived anyway so you can just write down the melting transform of double function integrate by parts and you will get the expression so let's go to the next board and see what that means for us by the way that lemma is given in that generality because that's the only melee transform that we are
going to use for all um for all our transform needs at least when um our k l divergence is normal crossing which is guaranteed by resolution of singularity and that's um that's next time uh but for now what this means for us is that our density of states function for our example uh remember it was our zeta function was that so that's our m that's our lambda so that gives us one false t to the well lambda minus 1 times log of 1 on t to to the um m minus one okay so that's just p to the negative half of negative log t and finally we get our z integral equal to the laplace transform of v of t um uh so this if you recall from the lemma is defined between 0 and a and our a is just 1 in this case and the function is 0 otherwise so the integral below from 0 to 0 to infinity reduce to integral from 0 to 1 of the expression above right um so we want to extract the asymptotics as n go to infinity we will do a change of variable from t change to t on n which gives what the integration terminal changed to zero to n um t on n to the negative half negative log t on n and e to the negative t dt and we can start pulling out ends so there is a there's a root ending here so and there is a oop sorry i forgot excuse me forgot the uh jacobian digital represent n so there's a root n canceling with one on n uh to give us one fourth of square root of n um and we split um so this is this term here is negative log t plus log n so let's write the log n integral first 0 to n log n dt minus 0 to n t is a negative half about t this time dt okay so this term um this first term here will contribute uh a log n up the front the second term wouldn't so our leading term now becomes log n multiplied by the integral and i'm going to uh do a trick um which is the integration terminal is going from 0 to n but let's write 0 to infinity and then short the rest so we are splitting um the integral from zero to infinity into two parts the first part is zero to n which is the original integral and n to infinity which is something that um let's write it up here so the zero to infinity integration is going from 0 to n plus n to infinity and that well the integration region goes to 0 as n goes to infinity but we actually need something stronger than that let's just write it out first um so this t to the negative half e to the t dt minus one on four root n there is no log n term for the second term again do it from zero to infinity log t and then shove everything that still has n dependence into a constant
uh that into something that depends on m um into a remaining term and it remains to prove that this thing is um little o of one on n it vanished faster than one on n so not only does the integration region shrinks um quick uh strings um the the integrand themselves well there's a e to the negative t uh in there so as t is forced into the larger part of the real line the far right of the real line the integral vanish very quickly by the way if someone wants to check my working i think there's a typo of in here uh for this term in the book um so please do check that and see if there's anything wrong with my looking here so that we have now get just recovered not recovered extracted the leading term and if you recall the definition of f f is negative log of this uh so uh the the leading variable here is uh asymptotically a negative log of x is uh half okay i don't have let me write it horizontally here so f is equal to negative log of that n which is equal to um half log n plus dot dot um um yeah so our lambda our rlct is a half in here which you can compute from the normal crossing form easily as well okay so that's the that's kind of the core um technique for evaluating a singular integral like this um for the remaining few minutes let's let's go back to the first board and i will just preview what uh what we should do next to get the full um the full result so we are um we are at kind of the bottom level oops so we have just done a simple example so this is our example um but the core observation that we made uh regarding laplace transform uh mailing transform and the density of states uh function is going to be a a constant technique that we'll be using throughout but the next from this example we will start to generalize into the form that we want the next step would be to add variables so going from this two dimension into arbitrarily large number of variables or n variables but with the restriction that oops with a restriction that um these quantities um so h and k's are vectors of natural numbers and these can see sorry that's completely wrong um it's h one plus one over two k one so these ones later on first yeah yeah um so these quantities are all still constants uh and it's equal to lambda and uh and the zeta function factorized but then um if you the the zeta function integral factorized into individual integrals and but evaluating those integrals gives you the same result and it's just one on z plus lambda raised or raised to the m the next level would be to
generalize this even further still to get to not only have these variables but we also have something which are probably called the irrelevant variables that the y's and we have the x oops the x where their exponents are k's and the worldwides where the exponents are k primes and uh the jacobian sorry the uh prior and the jacobian actually is is x h and y h prime and it's just that the lambda from before is actually smaller than the same quantities but for the uh for all the for the prime coordinates so those are the um coordinates done that does not contribute to the leading term so to speak um and then the next step would be to um will be to add in uh some relevant modification so that it starts to look like um the partition function that we actually want to evaluate so um this is invoking watanabe's main theorem um one which we haven't quite discussed yet so to get to this step we probably need to go to main theorem one as well so the addition here is that we add in these two terms and these are smooth functions and i will call this sort of partition function but in standard form standard because um everything that is relevant for uh evaluating the leading term is in normal crossing form and the final level would just be actually getting to the partition function the normalized partition function and the innovation day would be we will well the expression that we care about is um is actually well it's actually kn if you recall but then we will express kn as a difference from the quench version so the quench version is kind of centralized the expected quantity of kn and we will just express can as a difference from it so we will express k n as k uh uh wait i shouldn't do this live um uh bad is getting signs right always yeah um so is it minus k plus k but anyway so we especially that's a difference sorry um okay i'm just not going to do that so k and plus root n beta and so there's also dividing out um uh by k as well to make sure that things normalizes um so that central limit theorems and stuff can apply so i'm not go we should go when we discuss uh method and one well i'll go through that uh carefully um so as long as the idea here is clear that you replace k n by k plus the difference between k n and k and that's a random variable and then you and then this c is introduced to to be that random variable i mean with the square root taken out there yeah yeah exactly um so so what we're doing here is that we we have we have k but it is not a normal crossing form but