WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


The Central Limit Theorem states that when aggregating many independent random variables, the result tends to be a Gaussian distribution. Edmund Watanabe's 2010 paper uses algebraic geometry and statistical learning theory to define Bayes observables, generate functions and universal laws of learning theory. It also discusses the link between statistical learning and physics, and how normalization relates to a vending machine. Statistical Learning Theory studies what model works best for different true distributions, while Bayesian learning machines use a model to construct an empirical log function and average the answers. It is possible to calculate expected values and variance using a moment generating function and Taylor's theorem.

Short Summary


The Central Limit Theorem states that when aggregating a large number of independent random variables with finite variance, the result is a Gaussian distribution. This is a low-dimensional subspace, where only two or three real numbers are needed to completely specify the distribution. Edmund Watanabe's 2010 paper uses the background of algebraic geometry and statistical learning theory to define Bayes observables and a generating function for learning theory, as well as deriving universal laws of learning theory. It discusses the link between statistical learning and physics, and how normalization relates to the idea of a vending machine.
The Central Limit Theorem states that when data is averaged, it tends to follow a stable distribution. Statistical Learning Theory studies what model works best for different true distributions and learning machines. Bayesian learning machines take a model PX given W and a data set to construct an empirical log function and average all the answers from the different distributions in the model to predict the probability of a given X. Bayes theorem is usually written in a different form for convenience when doing mathematics, which is a blending machine parametrized by beta. Training error is evaluated by averaging over w, and is used to determine how good the P hat is.
Bayesian prediction involves randomly selecting a weight from the posterior distribution to make a prediction on a data point, then taking the negative log of the prediction and averaging it over the data set. The expected value of a function on W and its variance under the posterior can be computed by taking the average of the function over the data set and replacing the one over n and sum with the expectation of the true distribution. The expected block loss is normalized by the entropy of the true distribution and is equal to the KL Divergence between the truth and the model. The optimum parameters are defined as the set of parameters such that the log loss achieved its minimum, and the definition of uniqueness is that for all other optimal parameters, P naught of X is equal to Q.
A generating function with parameter alpha is used to measure the loss at a particular parameter value and is normalized by the optimal distribution. Moment generating function encodes the moments, with the first moment being the mean and the second moment related to the variance. The chemical engineering function of f(x)w is defined as the posterior average of the negative log of e to the power of negative Alpha f(x)w, related to statistical physics. Differentiating with respect to Alpha gives the quantities of the state variables, while differentiating with respect to Beta gives the energy. This function is useful for computing with the posterior distribution. Essential uniqueness states that all optimal parameters define the same distribution.
Gibbs training error and average Bayesian generalization error are calculations used to evaluate the training version of a generic function. Gibbs training error is equal to the negative of the expected logarithm of the posterior probability of the data given the model, averaged over the data set. Average Bayesian generalization error is calculated using a posterior average and normalized evidence. It is derived by multiplying and dividing by the normalized evidence, resulting in a posterior average class. FN evaluated at zero is equal to the negative log of P naught of x and FN evaluated at one is equal to the average of e to the x of negative log.
Taylor's theorem states that a Taylor series expansion can be used to approximate a function that is three times continuously differentiable. The formula for the Taylor series expansion is e of BG = l0 + fn1 - fn at zero, and this can be used to calculate the variance and second moment of a given data set. The posterior average of the log probability of the data can be calculated by subtracting the log prior from the numerator and dividing by the optimal distribution. This can be further simplified by treating the test data as a training data and absorbing it into the numerator, giving an overall evidence for the data.
In statistical physics and machine learning models, the details of the system do not matter as long as they satisfy the same thermodynamic properties or equation of states. This equation of states is analogous to the Central Limit Theorem, where larger and larger data sets are aggregated and the sixth variable only depends on three of them. F and L naught are variables that control learning behavior, and the randomizability condition states that if a system is renormalizable, gamma is equal to one. Tele series expansion is used to substitute back in the FN node, and the linear learnability assumption is used to express the difference between N minus one and N versions in terms of an error term smaller than one on N to the gamma.

Long Summary


Edmund Watanabe's 2010 paper on singular learning theory uses the background of algebraic geometry and statistical learning theory from the Gradebook (2002) and the green book (2018). The paper defines Bayes observables and a generating function for learning theory, as well as deriving universal laws of learning theory. The lecture discusses the link between statistical learning and physics, and how normalization relates to the idea of a vending machine. The goal of the lecture is to explain the concepts behind the terms used in the paper.
The Central Limit Theorem states that when aggregating a large number of independent random variables with finite variance, the result is a Gaussian distribution. This is a low-dimensional subspace, where only two real numbers in the one-dimensional case (mean and variance) or two real numbers in the three-dimensional case (mean, variance, and covariance) are needed to completely specify the distribution. The theorem has been generalized, and it is important to note that the details of the individual particles are not important, only the aggregate is.
The Central Limit Theorem states that when data is averaged, it tends to follow a stable distribution. SLT setups involve a true distribution (Q of X), data generated from it (DN) and a model (parametrized family of distributions in the same space with parameter space W). A learning machine takes in data and outputs a prediction, with a good machine having P Hat close to Q. Statistical Learning Theory studies what model works well for different true distributions and learning machines.
Bayesian learning machine is a predictive distribution which takes a model PX given W and a data set to construct an empirical log function. This is used to average all the answers from the different distributions in the model to predict the probability of a given X. The prior is treated on equal ground with the model and is used to keep the status of the prior.
Bayes theorem is usually written in a different form for convenience when doing mathematics. This form is a blending machine, parametrized by beta, which is used to draw something from the posterior, such as gradient descent. The base of syllables are analogous to observables in quantum mechanics and state variables in statistical physics. Training error is evaluated by averaging over w, and is used to determine how good the P hat is.
Bayesian prediction involves predicting a data point using a randomly chosen weight from a posterior distribution. The training loss is calculated by taking the negative log of the prediction and averaging it over the data set. An alternate type of prediction is the Gibbs training loss, which involves randomly choosing a weight from the posterior distribution and making a prediction on the data point using that weight, then taking the negative log of that prediction and averaging it over the data set.
Data sets can be used to compute the expected value of a function on W and its variance under the posterior. This is done by taking the average of the function over the data set and replacing the one over n and sum with the expectation of the true distribution. The result is a random variable that is the average of the variance of the negative log-loss of the data set. This is computable using data and can be used to draw samples from the Bayesian posterior.
The expected block loss is normalized by the entropy of the true distribution and is equal to the KL Divergence between the truth and the model. This KL Divergence is equal to zero only if the true distribution is realizable by the model. The optimum parameters are defined as the set of parameters such that the log loss achieved its minimum. The P naught of X should be equal to the distribution at this Optimum parameter and the definition of uniqueness is that for all other optimal parameters, P naught of X is equal to Q.
Essential uniqueness states that all optimal parameters define the same distribution. A normalized quantity, f(w), is used to measure the loss at a particular parameter value and is normalized by the optimal distribution. The normalized evidence is integrated with K(w) instead of nln, where K(w) is the average loss at the chosen parameter. True distribution is regular for optimal set which is a singleton.
A generating function with parameter alpha is defined as the expectation of a KN variable, which is a random variable of the data set. The cumulative generating function encodes the mean and variance, with higher cumulans not easily interpretable. It is also related to statistical physics. Moment generating function encodes the moments, with the first moment being the mean and the second moment related to the variance.
The chemical engineering function of f(x) is defined as the posterior average of the negative log of e to the power of negative Alpha f(x)w. This is related to statistical physics, where the posterior is similar to the neighbouring distribution and the cummulative generating function of f(x)w at Alpha is similar to the partition function. Differentiating with respect to Alpha gives the quantities of the state variables, while differentiating with respect to Beta gives the energy. This function is useful for computing with the posterior distribution.
Data average Bayesian generalization error can be calculated using a posterior average and normalized evidence. This is done by multiplying and dividing by the normalized evidence, resulting in a posterior average class. FN evaluated at zero is equal to the negative log of P naught of x and FN evaluated at one is equal to the average of e to the x of negative log. This can be used to derive formulas about the base of the verbose in terms of the generating function evaluated at various points.
Gibbs training error is a calculation used to evaluate the training version of a generic function. It involves differentiating the log and exchanging the derivative and expectation operators. The first derivative of the generic function at zero is equal to the expectation of Y times G. The second derivative gives the variance of the function. The Gibbs training error is equal to the training version of the generic function minus one, plus beta minus minus one at Theta.
Gibbs training formula is equal to the negative of the expected logarithm of the posterior probability of the data given the model, averaged over the data set. The first derivative of the generating function evaluated at beta can be simplified to the negative of the expected logarithm of the posterior probability of the data given the model, but only using n-1 data points as the posterior average depends on the entire data set. The expected logarithm of the posterior probability is the ratio of the truth over the model.
The posterior average of the log probability of the data can be calculated by subtracting the log prior from the numerator and dividing by the optimal distribution. This can be further simplified by treating the test data as a training data and absorbing it into the numerator. The result is a posterior average of the log probability of the data, minus the expected log probability of the data, minus the expected log probability of the new test data. This gives an overall evidence for the data.
Taylor's theorem states that if a function is three times continuously differentiable, then a Taylor series expansion can be used to approximate the function. The formula for the Taylor series expansion of a function is e of BG = l0 + fn1 - fn at zero. This formula can be used to calculate the variance and second moment of a given data set. By using the Taylor series expansion, the variables can be replaced with seven to nine variables. This allows for the remainder term in the Taylor series expansion to be evaluated in the region of interest, with the largest being one plus beta.
In statistical physics, the details of a system do not matter as long as they satisfy the same thermodynamic properties. This is similar in learning models, where the details of the learning machine and the truth do not matter as long as they have the same equation of states. This equation of states is analogous to the Central Limit Theorem, where larger and larger data sets are aggregated and the sixth variable only depends on three of them. This linear relationship between them is called the equation of states. As n gets larger, the difference between n-1 and FN is smaller than the same order, leading to a reduction in the number of variables.
F and L naught are variables that control learning behavior. Randomizability condition states that if a system is renormalizable, gamma is equal to one, and F Prime at 0 is equal to Lambda over beta plus Nu. Tele series expansion is used to substitute back in the FN node, and the linear learnability assumption is used to express the difference between N minus one and N versions in terms of an error term smaller than one on N to the gamma.
Generating functions can be used to transform a sixth variable into three. This is known as the learnability assumption and suggests that when dealing with large datasets, only the first three terms of the Taylor series matter. The fixed point of this randomization flow is parameterized by two numbers, which characterize the fixed point. This supports the idea that there are six observables and three terms modulus some irrelevant error.

Raw Transcript


okay not sure I needed to write all that but um it's uh so Watanabe um 2010. so I guess a bit of a timeline is that say the um the Gradebook uh is 2002. so great book uh background book I think I should say that gray book means algebraic geometry and statistical learning theory the book um that's um 2009 and then there's the green book um which is mechanical theory of Asian statistics that's 2018. um so that is kind of an evolution of um of techniques here and um um today I'm going to stick to the notation and technique uh presented in the paper um but I think the green book um has updated sort of notation and sleeker presentation and the quantities are defined a little bit differently so if you're reading that you might want to be aware a little bit um and the Gradebook provides sort of some sort of the foundational results that sort of feed into um later works so sort of the most of the ideas and tedious calculation is done in the gray book um and the green book and this paper sort of uses those um you know in a very crucial way but with a lot more sort of conceptual stuff layer on top of that yeah all right the author is the obvious here Edmund if you want to attach sure let me do that foreign okay so that sort of the technical goal um today would be um I guess it's um based on the audience I guess it's it's worth to go through sort of um uh the usual definition route the setup of our singular learning theory as well as Define some of those um quantities important quantities cause uh that that is known as Bayes observables um in this paper and then Define something called um a generating function for um for the learning theory and derive Universal laws of learning theory from from that right um so and there's that console conceptual goal um is kind of um explaining some of the the word choices above right so why do we use syllables and um what generating functions and what's it going to do with learning and universal law and oh and what is renormalization minimizability conditions Etc so but overarching um an overarching theme will be um link um sort of stats learning so statistical learning to that physics and just based on the terms being in use there are some there's already some link there and we will discuss I I don't think I can make it tight but at least I can I think throughout throughout the tedious calculation I will motivate um the link here um and sort of discuss the idea oh we know normalization in vending machines right um I guess our front load this
the following um analogy so I should say let me use so I'm going to mention this just so that we have it at the back of our mind throughout the presentation which is a central limit theorem so um [Music] I'm not going to state it in its full generality but um I think in sort of conceptually I'm going to study in this way so Central limit theorem says that agreed aggregating I don't know how to spell advocating um large number of independent random variables um result in a gaussian provided dot dot dot right provided uh some some of things whole um I guess I don't need to do dot dot here I can say something like um provided all of them has finite variants right and and there's there's a lot of generalization of this result um let me draw a picture um so we have kind of a sort of space of distribution with finite variants and then there is kind of a sub space of gaussians an important thing to note is that gaussians are completely specified spare C5 by damine and bearings by by two well by uh to run two two real numbers in the one-dimensional case and I I don't know D plus d times D minus one two real numbers in the three-dimensional case um but the point is that this is saying um if you start off in this space in this very very large space infinite dimensional space of distribution with finite variants if you start doing this kind of summarizing or aggregating and you have lots of independent realization lots of particles um and and then you start you start to say that I don't care about the um I don't care about the details of each individual particles what they are doing and I only care about Aggregates of them and if you take increasing increasing larger and larger aggregate of of these independent random variables then you will flow this this process of aggregating if ever larger number of independent random variable will flow you towards this um much lower dimensional um Subspace where where only a few [Music] um by a few I mean just the MU and sigma um details uh a few sort of correct characteristic of the system survive everything else are quote unquote and important details um and uh the generalization of this this Central limit theorem so um but it's getting it's getting a bit small the writing down the bottom there okay um that's oh um yeah okay you don't have to rewrite it's just I just want to stop you from further converging to variance yeah of your characters around this Center yeah I can see that the off can you know I can do that's very small okay I'll write
larger all right um but but yeah so the generalization of Center limit theorem will be um sort of stable diffusion stable stable diffusion sorry about that slip but stable distributions um uh so where if you have a heavier tell so no Planet variants you you flow to some odd Subspace um for example right okay um okay so um just there to uh so that's an analogy that I'll sort of keep hope everyone keep at the back of that nine so um claiming that that's kind of a broad conceptual uh renormalization the the course graining of of data resulting in uh in a flow of model by model we mean the the sub SQL distribution describing the data um that model as you co-screen the model flow towards a fixed point all right uh okay let's start off with some non-conceptual and technical stuff right so um here's sort of revisiting um SLT setups and I guess it's good to go through this um just one more time so usually we have um uh a true distribution and we usually notate that of Q of X um that's sort of Cordon of from our world that's something that we never we can never interact with directly but we can um but we do know that it we can only access it through the data that it generates and we usually annotate that as DN and each individual data generated from it we have other access and little n it always means data set size and we assume that the X i's are IID so independent and academically distributed ID from Q of x right and um and then the other component is well you have you know humans um you so we design or come up with a model which is um which is a parametrized family of distributions that meant to um that are distribution in the same space with parameter space we call W and it's parametrized uh by a subset of R to the D and D will always be the parameter dimension right and what is a learning machine well a learning machine well it takes in um facing the data and texting the model that you design and spits out distribution uh well the prediction without a prediction so that's so a good learning machine will have P Hat close to Q close to the truth um and the theoretical properties of this this diagram this this um of this learning machine is the the setting of statistical learning theory right for for what kind of model what model works well um for different types of learning machine for what true distribution that's kind of the general area of study of physical learning and in singular learning theory um at least in its current form we are mostly concerned with the Bayesian
learning machine so what that means is that so I just want to specify what what is this P hat thing and that he had in the Bayesian learning machine um means the the Bayesian predictive distribution um so that's equal to um well it's saying that to produce um the to predict the probability of a given X well you can ask what's the probability of X for any of the uh distribution in the model but each of them will have different kinds of performance but we just average all of them average all their answers so this is in machine learning term this is called Bayesian model averaging um using the posterior and here I'm using sort of physics notation here like this is like ensemble operating so already I'm sort of treating the Bayesian posterior as um as analogous to the building distribution well in um in an explicit form that's average in this function with the patient posterior um posterior P of w given the n is equal to one over the partition function of the evidence e to the negative n l n of w ow is the empirical block loss um so that end is the evidence um I think you need an eye on your sum and maybe x i thank you thank you that's capital x i and that thing there is a prior okay so in my picture above I don't really know where to put my prior I guess it's I I'm conceptualizing it as part of the learning machine it's part of that that machine that it carries a little bit of Prior or human can specify right um I guess it's important to note that in multiple places mentioned that the prior although it used the same word it doesn't really mean the usual thing of the prior being specified in a prior belief over the parameter space it's it's really uh it's keeping that um the status of the prior us well where does it appear in the formula that's that's the status kind of thing so frequently it's treating both model and prior on sort of equal ground okay um please stop me um if you have questions that's new to the next board okay let's define some base observables so the idea is it might actually be better if you move back a little bit so that this is the second board and people can keep seeing yeah there's good yeah okay cool yeah thanks right so um so think of this as your question yeah yes please yes in the previous slides this this page and learning machine is p hat so yep the to get this P hat the inputs you have is this model PX given W as well as this data set the end the data set is used to construct this load the empirical log function right okay so if the data sets give you this
lnw yeah yeah okay and yeah yeah yeah go ahead yeah so I guess it's worth pointing out that um this is this is this is not usually how Bayes theorem is written um this is a convenient form for us to do mathematics uh later on like usually this is um usually we have substitutional code co uh call this this for e to the negative n l n term here the likelihood hmm so um so that's that's a function of of w and you like if you unwind this this is just the product over all um all the data like that right um yeah all right I forgot there's a beta in here and let me just put that in so so beta equal to one you you get back the strict Bayesian learning uh Bayesian um posterior and beta not equal to one it's just a temperate posterior so just another kind of learning machine so this blending machine itself is parametrized biodata should probably put the beta in the partition function then too thank you and beta that's very legible in here that's fine okay yes that is your picture of this learning machine okay yeah good thanks yeah I I I should probably also say that the reason I I sort of conceptually conceptualize it like this is that well that learning this this is not much of a practical learning machine like uh this integration patient motor average in complete based on aluminum averaging is almost surely not achievable for neural network in in practice um but that this machine could be something like um uh let's draw something from the posterior or uh uh sarcastic gradient descent or gradient descent or whatever right so you have a procedure that come out with P hat that's the your learning theory um goes from that goes from that right so in the next board uh we have uh we want to Define base of syllables and the term of the bubble is astral analogous to the observables in I guess that's mostly used in quantum mechanics but I guess the statistical physics are also used has has the term State variables and analogous to that like the number of particles um the volume uh internal energy entropy Etc um yeah um I actually think that the obstacles that we are defining is I'm writing all extremesake um set variables here I think most of the things I'm going to Define are intrinsic uh sorry extensive versus intensive but anyway so again so we have the base um so training error so so this is equal to well you can we can sort of evaluate um the how good our uh our P hat is um so remember P hat is the base average um of a w so whenever you see an angle bracket and it's just averaging over
anything that depends on W in it so for each of the data points we can that's that's the prediction so that's equal to that's equal to P hat of x i um so if that particular W is really good um then that number will be very high take the log take the negative which means that instead really high we want it to be really low and then average that over our data set um so that's our base training loss so it's a training loss for um it's a lot loss for average lot loss for our P hat and then there is another way of um doing um another way of doing prediction which is the git type um vending machine what is that average over which average sorry the the inner one the the brackets the break sorry throughout this talk uh bracket always means um posterior averaging yeah so uh we're thinking we're thinking of the posterior like the Gibbs distribution and angle bracket for um Ensemble averaging um right so there's a Gibbs way of doing um uh doing prediction which is let me randomly choose a w and I just use that for well randomly choose it uh choose choose a w how do how do we do the randomness well the randomness come from the patient posterior um so uh W with high posterior mass will be chosen more often and then make the prediction on X I using that that one and then that's the loss take the log type of negative that's the um that that's the log loss of that prediction average that over all your data set that's an average log loss well this is but this is one realization of w so since we are since W comes from um the Bayesian posterior let's just average over the posterior so that's the Gibbs training loss okay yeah okay can you show the explain a little bit about this activation which the angle bracket are averaging yeah so um so so you do have um how would you actually do it or what it is yeah what it is it how how to do it okay right so if you have um so if you if if your W is always drawn from the patient posterior so does this notation make sense to you um then um if you have any function of WF of w well now this is a this is a random variable that depends on W you can take the posterior average of that which is um which is equal to half of w and um positive integrate that overall so that's that's what it is okay yeah I see it so you you yeah so the the density function you take the one you defined in the previous slide for average you average this okay good yeah yeah correct yeah the idea is meant to be that BT and GT are somehow things you might actually compute and to do that you would have to
somehow draw samples from the Bayesian posterior and to do that you would use Markov chain Monte Carlo or some variant of that that may be difficult yeah but in principle you can do it so these are things that you can compute in practice um uh so using um data set um really hard though but you can you can actually compute it right it doesn't refer to the true true distribution that's that's kind of the the point yes so they each each data set will produce you this this function on W and then you arrive and average it using that density yeah okay yeah so once you have the patient learning machine um you know pin down your beta or something then everything in here can it is it's just um very difficult some say integrals of the data set um with the data set as input and integrals over the parameter space um right and then there's the theoretical quantities which is because you just replace all the um replace all one on n and sum with um expectation of the um the true um distribution and uh sorry replace that with that breath so um so those are those are theoretical quantities right refer to Q um um should uh it should be noted that um all of these so even though on the right hand side um we have average over the X we have an average of a data set right so these are still random variables this still depends on um this this bracket here depends on well it's a it's an average over the patient posterior and the patient posterior itself is constructed from the data set and the data set Is Random so so this is these are still random variables okay and we have some other quantities um I should not have so this is that red comment about still apply for the one below this is also computable using data so let me just write out the expression and um okay so the square goes inside and outside um so this should remind you of um um variants of sorry empirical of well of negative log um of the log loss but because well let me just put it in of the log loss right because negative does matter in variance it's got squared away um and then there's of course the um um the average version of that so um sorry I said something wrong so that's the variance and it's not even empirical um no empirical um so it it's under um the posterior right so um so this whole thing is equal to variance under the posterior of negative about key x w um it's just that we average the variance over the data set and here we just average we do the same thing except we take the data set average to the theory
coverage again the negative doesn't matter okay when you write that equality it's a equality VW I mean the sum over R is not still there or do you just mean that the sum end is equal to that yeah x i yeah okay so each term in the summary is equal to the variance of the um log loss at x i okay um I should also note that these quantities um defined in this paper are slightly different from the one in the gray book and those things in the Gradebook are normalized right so they are compared where where their values are compared to the to the truth or to the optimal kind of distribution so um that's just a warning for reading but it shouldn't affect us too much as long as we keep out um be careful about our population so there are some other direct sort of theoretical quantities um and sort of theoretical relationship to sort of classify uh models so the first one is um we have the first one is the um expected block loss right and well that that thing it so is expectation of um a quantity that we have seen before um the did that affect um version right and we can um we can normalize this with the entropy of the true distribution um I'm going to absorb the negative into that into one of the next um so that we have that becomes the KL Divergence between the truth and the model at the distribution of the model at I believe which is equal to LW minus s um that in itself is an expectation of um that and that in itself is normalized by the empirical version um of from the entropy and just replace integral with sums of the data right so this dkl is equal zero is equals zero only if so um definition truth Q of X is realizable by model p x w right so that means um there exists w0 in the parameter public parameter space such that Q of x equal to p x w naught um if not realizable we turn instead to to talking about not the truth itself but to the optimum so w naught will always denote will always be defined as the set of parameters such that the log loss achieved its minimum right so um parameter space w Optimum W naught um Q could be outside of w right sorry now fix a w naught in this set of optimum optimum parameters um and set P naught of X should be equal to the distribution at this Optimum parameter so just choose one so this is again so realizable another way of saying that is that P naught of X is equal to Q that's easy to see um but we also want the definition essential another property we could Define is essentially uniqueness which is saying for all other optimal parameter in
the entire the distribution is the same so you could have different distribution achieving the same care Divergence between the distribution and the truth uh but we don't want that for um at least in the set of optimum distribution we don't want that to happen so this is called essential uniqueness um and we will assume this throughout um this paper this campus is essential uniqueness it's saying that for for all these two printer uh-huh that those are not true parameters those are optimal parameters okay okay yes so it's p naught X will be the two distribution no Pinot X is one optimal distribution so I fixed a ww0 pick one pick any one of them um essential uniqueness it's saying that it doesn't matter which one I pick which which optimal parameter I pick it will all Define the same distribution that's essentially uniqueness I see it okay yeah okay so so this is It's not realizable it's defined yeah yeah okay correct yeah okay um then we can use um we can use that optimal distribution to Define some normalized quantity so previously we have um this uh this this function that measure the the loss at a particular um parameter value we normalize that and call it f o yeah I wrote something wrong in my notes that should be normalized by the optimal not normalized by the truth so if you are reading the gray book P naught is just Q if you're reading the green book or in the unreal not unrealizable setting um just Swap all quantities that reference to Q in the theoretical quantity um and replace that band Keynote so lmw by the way yeah yeah there is a there is a typo in the essentially the Nicholas that we want to put at P it's given w0 Prime thank you thank you thank you yeah that's that's very important I mean yeah okay cool thank you right um so lmb becomes um KN and it's defined to be so I'm using the same letters the K's um is the same except now the cues are replaced by um p-nots so LW sort of normalized to K of w which is um the main quality of Interest well L naught oh what happened there L naught is equal to l that the the average loss at the chosen parameter but it doesn't really matter which parameter we choose that is just all equal to the up mean so L naught is actually a constant on w0 okay and um the evidence normalized to that so instead of integrating with nln we the exponent becomes K and instead okay and um so we've owed this um we say that the true distribution is regular um for P of x w is the optimal set is a Singleton so nothing about realizable here so it's
regular for for W2 if the optimal set is a single turn and um well the sort of the Hessian of L at w0 um is positive definite um and singular for the model um otherwise okay um I this is a bit boring ad that I took took a lot of time um and I think this is uh the first possibly new thing um that's go ahead and Define the generating function for a learning question so the idea of generally function I think we are familiar with we encode quantities that we care about as coefficients in the power series wow um I'm going to write down the depth the definition so the general um function with parameter Alpha is equal to thank you okay uh sound remark um so that expectation uh means um so in the paper it says that it's an expectation of a that off that so um the data set features in in here right so that's what goes into constructing that KN um variable so that came is a random variable of the data set and then there is this new test variable here um which is a new data point um and we want to know the prediction on that new test ing point so I guess the data set is a training data set and this is the testing data set and this expectation is over o n plus 1 number of the uh of data so the original n plus a new test data set and I'm going to be explicit about that and say that it's actually taken in this order um so the inner like everything here is undermount assumption these are all commute they can commute and their iids and E sub n always means expectation over the entire data set um so but in the green book um another thing a different expression is defined is um in the green book uh sort of a g and of alpha is defined as the cumulon generating function um of of this random variable okay what's the cumulative generating function well um humanitarian function for any random variable y is equal to um log expectation over um the law of Y otherwise so that's the moment generating function and cumulation internally function is just a log of the normal generating function well it encodes um the so moment generic rating function encodes the moment and the first moment is the mean the second moment is related to the variance third moment Etc uh cumulon while the first moment a first cumulon is the mean um second moment is actually the variance um and higher moments a high higher cumulans actually uh different uh not easily interpretable and um there's a statistical physics connection if we are using the cumulent generator function I I should complete the thought before
so I think the green book way is more natural just to Define it as a chemical engineering function of FX that's because um so it's a human engineering function of f of x under the posterior and you can see the in the above expression here this is actually equal to um uh let me write it here this is equal to the excitation um of a negative log of the posterior average e to the negative Alpha f x w well Excel is not quite right because um the the normalizing the evidence the normalizing constant is is not there um and the the if you if you put it put it back in um then uh you you get the right thing um and in the green book it doesn't um it doesn't immediately take the expedition of the old data set to keep the exposition I think a bit slicker so almost equal anyway uh the statistical physics connection is that um we have W is like the microscopic um variables by the way this is my interpolation if someone wants to dispute that please do because um as I said this is not tight um we have that's defined h n to be n k n of w so um just removing the one on N averaging before that's like the hamiltonian HW in statistical physics the posterior um well if you use hn that looks like except there is about the the the the the the prior there but it's like the um so equally neighboring distribution um or both men um well and then the the cgf um of f of xw um at Alpha that if you use the green boot formulation that looks a lot like the partition function um which is z equal to in statistic physics is equal to um Quantum beta log e to a negative B H um foreign right I guess Alpha is beta um yeah so okay uh so in physics you you differentiate respect to Beta to get the usual uh to get the quantities that the state variables that we want the energy in ETC so um in moment generating function you differentiate respect to Alpha so I guess the connection here is like Alpha is beta and then the beta we use for inverse temperature but we use for temperature in um in our posterior distribution that's something else yeah so yeah there's a bit of inconsistency that right sorry useful dictionary thanks okay uh okay so I'm going to spend um some time just Computing with this and see why why we care about what why we Define this function let me walk to the next set of boats uh all right let me so let's just start Computing with this function um so that's equal to e n e x um negative log so I'm just reporting reproducing this here in a sort of rewriting in a slightly different form foreign just to make the fact that it is
um it is a posterior average very very clear so if you just multiply and divide by the normalized evidence you you can rewrite this as [Music] um so I'll think of multiplying and dividing by by that and taking that inside here you that's that's actual that becomes a posterior average um and then you acquire uh another term of the front so this becomes posterior average class um right and okay first of all this um the second term the ex doesn't matter because the random variable inside doesn't depend on it and if you recognize that phone that's just the three the free energy about the normalized free energy yeah so that second term is the expectation of the free energy you're a nice very much all right um some some value of Interest so FN evaluate at zero well you um so that term goes away and you are just evaluating the um the evidence and that's the free energy that's the second term right so so this term is actually equal to FN have a script FN at zero so and this is equal to oh great okay um what about fn1 at one well that's equal to sorry oh should I put this here sorry substitute Alpha equal to 1 in there you get uh X of negative log the average of e to the um okay I'm going to expand the F function and o plus n on zero because there is that annoying to concern that well this is equal to the E and logs cancel and P naught doesn't depend on uh doesn't depend on W I should just annotate this this little n everywhere in the bracket because that will be important later um so the keynote doesn't doesn't depend on W so it doesn't it is not affected by the posterior average so you can put it out and then it becomes um um so if you pull it out and then take the negative log of p p naught of x negative log of that term and you take the average over the X average over the truth using the true distribution well you get l0 you get negative l0 so that's that's from pulling out the P naught and then what's left in the bracket is just um uh Plus e n e x of negative log that's what's left in the bracket well and this this inside is BG so that's equal to so we have uh we have our equation so therefore data average of the Bayesian generalization error is equal to l0 Plus um that gets right minus oh yeah so fn1 minus and zero okay um all right so you can continue in this fashion and derive um derived formulas about the base of the verbose in terms of um [Music] this generating function evaluate at various points so let me just um not too too much more calculation you can see some of the details in the
notes um but let me sort of write out a few so that's equal to the training version while the training version um is FN minus one one plus beta minus minus one at Theta um so the logic behind that is that to get a training version you um you actually I'll actually do a calculation for the Gibbs training error and that will illustrate the idea of how how you how you do that um but let's go to the next board um so what's important uh what's important or generating function is that you can differentiate and um you can differentiate and extra quantities well the first derivative of the generic function is okay under the assumption that you can exchange derivative and the expectation operators and under our usual usual assumption where main theorems of singular learning theory goes through that that will be true so you bring the derivative into the expectation integrals and differentiate the logs and that just sort of interrupted when uh this is really valuable and I hope you finish but I'm going to have to go I'll leave the recording going it's just I've just gotten a message and I have to jump up so um just go as long as you like and finish what you have to say and the recording will catch it okay cool yeah thanks I'll I'll see you later all right I promise to stop by 5 30. um Okay so you get you get that expression and um as with um as we move communal engineering function or generating function you you want to evaluate the um So eventually you want to evaluate the its Taylor series at zero um so let's evaluate the first derivative at zero what you get is right and um quickly to do the second derivative um use the product rule or the quotient rule so if you do the question every you will see you know that you bring down the F by derivative by Alpha and the second one you get the average of that square um instead and that's a very reminiscent of a variance formula so if you evaluate at zero uh you get what do you get you get excitation um at negative f squared plus f acilitation squared well that's equal to negative of um e of Y G remember that's one of our ways of observable the functional variance is called the dimensional variance okay um okay um I'm going to do just one more to uh to illustrate a few more tricks that's involved in getting these quantities and then we are going to go to the sort of General theorems about this but so so next we want to show so so we want to show uh the next Formula which is um that are training the Gibbs training from it the training version is always
um is usually difficult it's usually more difficult uh just to remind ourselves what that is um that's equal to uh or what's the Gibbs training formula again it's equal to that well let's simplify that a little bit again oh by the way so this is uh this is this is e so that's that's e e n and e x in there but uh dot but G of T itself doesn't depend on that new testing data set so the the ex drops out um n uh this is a finite exam so just pull it out uh so you can pull out the finite sum um over um over the the posterior averaging and averaging over the data set but the posterior average and the data set average they definitely do not commute right because the posterior itself is a random variable of the data so you can't push it in any further what you can how you can do how you can still observe is that the x i so since X i's are ID so they all come from the same distribution and they are all independent it means that um uh let's look at each term in the summon well uh uh well it means that there's a each term after you take um excitation over the entire data set all that all those term is the same uh so that is actually equal to negative 1 on n uh times n times e n of log of p x i given w M for any I right so every time in that um is is a term for a different x i but every time the the posterior average depends on the entire data set but internally inside that if you write out all of them they are completely interchangeable so for any eye that's the same term so you just get the same number times n and the end cancel so the whole thing is equal to um negative e n log and let me just choose an I let me just choose it to be x n um there okay all right so we want to prove that we want to show that this we want to show that this GT is equal to it's related to the generating function um is equal to l0 plus the general the first derivative of General function at n minus 1 um wait and evaluated at beta uh so let's just compute um that's derivative at beta so we have our expression of the first derivative there so substituting um n minus one so that's n minus 1 and then you have um that but be careful that this is um average over the posterior is now constructed with only n minus 1. there's a there's an end dependency in the posterior itself so you only use n minus one you don't use the last um data point okay and uh and then you just start unwinding definition and well remember what f of x w is That's the Law quotient of the truth over the model and again we use the same trait the the
truth doesn't depend on uh on the model parameter therefore it doesn't it can be pulled out from the posterior average so you get a log P naught X here um but you have another term that's the numerator again those are o and minus one and let's look at what the the bottom the bottom term well the bottom time is equal to um so make the exponential and the log cancel you get um to power beta divided by the optimal distribution to power beta n minus one uh well again the optimal distribution doesn't depend on the model parameter so you can can be pulled out um and this term up here well we are going to take um okay so this term here so what is that equal to that's that's equal to well that term is the same as the numerator right so uh so that term in the numerator completely cancels so our first term here is just this term and what is that term after taking expectation by e x um you get uh you get the l0 right that's um that's negative l0 so that's the first term handled uh minus um so what's the second term um um the second term is equal to or the numerator is p x w to the power of beta that comes from this exponential here um log p x w n minus 1. um well it's divided by the numerator but let me uh um so divided by the numerator of p x type it into the Beta now the trick is to see that this term so what is this term this term is equal to um so I'm going to write the posterior in the usual way without using the exponential so that's I equal to 1 to n minus 1. um prior DW and well these are x i Okay so so uh so if you beautiful okay so that's equal to the idea is that you absorb this inside here and what you what you get is um you are missing the X and term but you add in another the test data term which is which is of the same uh with the same distribution and that recovers the the evidence for for that n right so that's equal to um so that's actually equal to that n I wrote something different in my notes so I think I think this is the correct logic though okay um and and you and then there's this thing to um care about which is well this is uh en but without the expectation of the entire data set but without the last one but you can replace that with um uh you can sort of treat the new test data as a training data as the um us playing the role of XM and if you write out this integral here what you get is negative l0 minus en of log p x n w n now this is posterior average over om okay I don't think that's very clear so the idea is that you you write out the integral and notice the um the Symmetry
between the test data and the last data point and you can um and also remember that you are taking averages over um you are later on taking averages over the ID data data run a variable anyway so you can treat them and that's that recover [Music] shift sorry that's not very clear but I think uh you should just write out the integral and I'll make a meeting notes available sorry I think um we might be a bit running running out a little bit of time but uh this I think I'll just make this last point which is um which is that in the formulas we have for sorry what do we have at the moment we have we have e of BG is equal to l0 plus um fn1 minus f n at zero and then we have similar things if you push through the calculation you have so the keep track of how many um variables are there in the right hand side so so some of this we didn't calculate but some the calculation isn't very hard um some of them are harder that's wrong that is beta the second derivative um up here as you can see that's the second things the Y are variances and that's related to um the second moment the second derivative therefore okay so we have we have all this kind of formula all this formula but it it looks like um it looks like we have just replaced um six formula wave one two three uh six variables and express it in terms of one two three four five six seven eight nine variables um but the trick is if you do Taylor expansion which is why we are defining um generating function in the first place if you do the Taylor expansion um sorry bye Taylor's theorem so we don't actually need it um to be to be convergent to the type of convergent power series we just need is this function is um three times continuously differentiable then for o n for o beta that exists an alpha star so that means telestrium with with a with a remainder term inside um so inside the region that I want to evaluate this Taylor series well the the looking at the quantities of both um I'm I care about uh starting from zero and there's one the largest is one plus beta so there's the upper star in that region such that f zero Alpha Plus um forgot about the half of a square plus 1 6 3 and Alpha star for the last remainder term of a cube so we have this equality and you can replace um you can replace some of these f1s um with the corresponding Taylor Taylor series approximations provided well so if this remainder term there the Supreme over all possible Alpha stars in that in that interval is of little o of um of asymptotic asymptotic
order smaller than n to the gamma well then you can start to replace um for example um you can start to replace for example f n of 1 by just the first three terms um with a penalty of uh one on into the gamma um uh inaccuracy and okay but there is also the f n minus one's terms so we also need the fact that the difference between those is also letter of one on Angelica um I think I'll refer you to the notes and the paper for the precise um conditions but the main idea is that the remainder term is of all the smaller than enter the gamma for some gamma and the difference between n minus 1 and FN is smaller than the same order as well um and then you can which means that the six variables the six observable on the right now reduce down to it depends only on three variables that variable that variable and that variable foreign so I think I'm just going to say sort of the point of all this verbally so now we actually have um now we have a sort of a similar situation to um the Central limit theorem where um there are some details if you um if you have larger and larger data set and you aggregate more and more of them these variables that the base of the vote that depends on the details of the trip triple the the truth the model and the the prior so day depends on um everything in the triple but as n goes through as n when n is large enough somehow this sixth variable only depends on three of them which means that you can there's actually a linear relationship between them which is uh which is therefore called um the equation of states which um in analogy to physics um there is obviously uh and you can move today right a little bit walk to the right a little bit oh yeah thank you okay yeah um so these are the equation of states and the idea is that in statistical physics the the sort of details of the system so whether or not we are talking about uh the ideal gas um or um uh or monoatomic ideal gas or um molecules that can be treated and we are in a low enough regime energy regime that the internal degrees of freedom of the molecule doesn't matter or we are talking about an easy model and things like that so all that details doesn't matter as long like they satisfy the same um thermodynamics properties they have the same thermodynamics properties if they satisfy the same equation of states so in in learning uh in learning model we have this kind of a similar situation where um the details of the learning machine and the detail of the truth doesn't matter as long as they have the same well as
long as they have the same L naught have not and F no right so these are the variables that sort of uh control um the the learning Behavior that's why I thought that's that's weird actually if you write if you write out the um if you do the tele series expansion let's substitute back in the the first um the FN node sort of cancels so so it depends only on these three things and the the result in the gray book and also the result in this uh in in the later section of this paper the randomizability condition says that if you're in a renormalizable then um gamma is equal to one meaning that you are you satisfy the um the older conditions just now with gamma equal to one and you are defined by and in that case you have um you have um F Prime at 0 equal to Lambda over beta plus Nu uh sorry of new and that is equal to two new over Lambda is the RCT and mu is another virational invariance um called the uh uh stochastic actually what is it called again um it's it's a variance term right so okay I think I'm going to stop here and I can we can talk about implication next time and the representative sorry for going way over time um uh any questions yeah how about this FM Prime beta yeah so so the the so if you go back to let's go back to the second to last board um the idea is that you let's say okay let's say we want to let's let's re-express um the second second term here so uh how do we do that so you have um so f n minus one um One Plus beta well what's that equal to according to the Taylor series that's equal to f n minus one at zero uh plus f n minus one at zero times one plus beta uh plus F sorry that's a prime that's a the derivative that's a second derivative and minus one zero at 1 plus beta squared on two and then plus you know because of the linear learnability assumption plus an error that is smaller than um than one on N to the gamma you do the same thing for f and minus 1 of beta [Music] um actually let's let's forget that first you you first say say that using the additional assumption um which is this assumption here that the difference between the N minus one and and N version is um uh is also out of other M1 one on enter the gamma you can just erase the all the a minus one in here with the same error term um and then you do the same for the uh for the second term here take that take the difference and then you you now express everything in terms of that term and that term right that's the idea okay okay so so only only these two terms matter the first time actually always
cancels out because that's the that's the expectation of free energy if you pick okay so just just for the the whole story so you want to know the sixth variable is observable that's the thing we want to know then so please go ahead yeah yes I've got gone you you can oh tell me right so this and this generating function makes this sixth variable into this three yes so in fact the um so if you have a different truth or different model meaning a different learning uh statistical learning uh problem then you will have a different um you will have a different generating function right the generic function depends on the truth depends on the model depends on the prior and if you do if you do a Taylor expansion up to a high order up to a 100 then you have this F F and um Prime uh the derivative of F and 100 of them but here the the learnability assumption is saying by the time you get to order three you are in the um you are in the one on N region meaning you are decaying right the the if if you are talking about large number of particles large and large number of um training data then that starts to not matter so which means that um which means that only the first three turns are matter and that means that any statistical learning problem with the same it doesn't need to be the same generating function you just needs to match um the like match the tree jet of the um of the cumulative engineering function so if your Taylor series match up to up to three terms then you are in the same Universal universality class like if you call screen in the sense of taking larger and larger data sets looking at larger and larger realization then you flow to the same fixed point and those fixed points uh so we didn't discuss this today but those six points are parameterized like the central limit theorem is parametrized by the mean and covariance of the gaussian well the fixed point for our learning theory is parameterized if you have gamma equal to one is parametrized by the ROCT and that new term that variance term by two by two by two by two numbers uh and those numbers are uh yeah so you you those two numbers characterize the fixed point of this um randomization flow I'm using randomization in a very um conceptual non-rigorous way which I guess the physics people does it as well so any other question oh so so I'm pointing out the fact that there is six observable and three um just depending on three terms modulus some irrelevant error I'm pointing that out as evidence that we have