WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


This video explores how to use statistical physics and Occam's Razor to select models and learn from patients. It introduces the Jeffries prior to the space of probability distributions and uses Channel's Lemma and information geometry to arrive at the prior. The paper also introduces the Kullback-Leibler divergence, Taylor expansion and Jeffrey's prior in order to compare two sets of distributions. The Razors Edge model selection criterion uses the asymptotic expansion of the true distribution to compare models and determine which is most suitable.

Short Summary


This paper examines how to use statistical physics and Occam's Razor to select models and learn from patients. It considers the space of probability distributions, rather than how they are parametrized, and sets up a scenario with a set of N events drawn from a true distribution. To determine which of two sets of distributions is better, the model evidence of each set is taken as a weighted posterior average. The paper introduces the Jeffries prior to the space of distributions by discretizing the infinite number of distributions and justifies its use.
Jeffrey's prior is a probability distribution used to determine the likelihood of a finite number of observations coming from one source or another. It uses Channel's Lemma and information geometry to arrive at the prior. However, this prior may not necessarily reflect a reasonable assumption about the data as it treats all parameters as equal. The speaker suggests that the prior corresponds to the density of states in a physical system, rather than focusing on the uniform prior which is easy to define in a discrete case.
Jeffreys prior is a method used to compare two distributions by calculating the Kullback-Leibler divergence and Taylor expansion around a point. Statistical physics and connectionist learning are related in that Jeffrey's prior is a canonical choice for parameterizing a metric and normalizing it into a density. In Bayesian learning, the posterior distribution is proportional to the prior multiplied by an exponential weight, which is the negative log likelihood. This is analogous to statistical physics, where the Boltzmann weight is defined as the exponential of the negative beta times the Hamiltonian, multiplied by the prior. The partition function is the normalizing constant for the Boltzmann weight, and the model evidence is the partition function at beta.
Partition function is a generating function for physical quantities in statistical physics, and Bozeman distribution is the normalized posterior weight corresponding to the posterior. Free energy is defined as negative log of the partition function and average internal energy is derived from the free energy by taking the derivative with respect to beta and integrating the Bozeman distribution with the hamiltonian. Heat capacity is related to the partition function and is derived from the negative beta squared of the second derivative of F. The hamiltonian is equal to negative one over n times the average log likelihood, which converges almost surely to the KL Divergence between the true distribution and the density of theta plus the entropy of the true distribution.
The speaker discussed how to make analogies to statistical physics by talking about annealing or quenching and using SRT notation. Hamiltonian energy is composed of two terms; a first term which does not depend on the random variable, and a second term which converges to the mean of the random variable as the number of data points increases. The free energy is introduced by Laplace approximation, equal to the ground state energy plus a logarithmic term and a normalizing constant. This is interpreted with a geometrical lens, and asymptotic consistency is guaranteed for regular models if the sample set is large enough. The best guess parameter will be the one closest to the truth.
Model selection involves comparing two models with different accuracies or energies. If the sample size is large enough, the higher energy model will always dominate. The complexity of the model is measured by the D12 login, which is related to the logarithm of the determinant of the covariance matrix. The robustness of the model is determined by the Hessian, which is a positive definite matrix. Its eigenvalues correspond to the size of the axes of an ellipsoid, and the flatter it is the larger the volume of the ellipsoid, indicating a good distribution.
Razors Edge is a model selection criterion used to determine the best model to use for a given phenomenon. It involves computing a quantity called RN, which is the asymptotic expansion of the true distribution including parameters, optimization and ground state energy. This acknowledges the trade-off between accuracy and complexity, and is more careful than other modern statistics. It also includes a log-log order term, usually a constant, to further consider the degree of freedom, robustness and size of the distribution space. Information Criterion is used to compare the models and select the most suitable one.

Long Summary


This paper looks at model selection and patient learning in terms of statistical physics and Occam's Razor. It looks at the space of probability distributions and does not focus on how it is parametrized. It sets up a scenario with a set of N events, drawn from a true distribution. The question of model selection is given two sets of distributions, A and B, to determine which one is better. It looks at the principle of Occam's Razor in a geometric formulation and discusses phase transitions.
Bayesian model selection is a method of predicting the probability of a new event based on a set of events. It uses the Bayes theorem to calculate the probability of a set of hypothesis given the evidence. To do this, the model evidence of each set of distribution is taken as a weighted posterior average. This paper focuses on the set of distribution rather than the parameters of the distribution, and the choice of prior is a perennial debate.
The paper discusses how to apply an uninformative prior to the space of distributions using the Jeffries prior. This is done by discretizing the infinite number of distributions by asking if two parameters are equal at a fixed resolution level, known as statistical resolution. The paper provides an original contribution by justifying the use of the Jeffries prior.
Jeffrey's prior is a prior distribution that states that if a finite number of observations are made, the probability of the distribution being from one source or another can be determined. It uses something called Channel's Lemma, which is a change of coordinates, and information geometry to arrive at the desired prior. This prior is theoretically nice, but in practice it may not reflect a reasonable assumption about the data, as it treats all parameters as equal.
The speaker agrees that searching for the truth in systems is a major roadblock to understanding singularity. The paper discussed comes from a probability theory background and assumes a discrete case, where the uniform prior is easy to define. The speaker has an opposite orientation, not caring about the prior as it disappears in the asymptotics. The prior only shows up when trying to do integrals and is a nuisance if it does not have a good form. The speaker suggests that the prior corresponds to the density of states in a physical system.
Jeffreys prior is a method used to measure the difference between two distributions. It uses the Kullback-Leibler divergence and Taylor expansion around a point to calculate the quadratic form of the metric, which defines the Romanian metric on the parameter space. However, it does not work in neighborhoods where the data is singular.
Statistical physics and connectionist learning are linked in various ways. Jeffrey's prior is a canonical choice for parameterizing a metric and normalizing it into a density. This prior is not supported when the Fisher information matrix is singular, and the normalization constant may not exist if the parameter space is not compact. Lamont and Wiggins (2019) discuss the connections between these two fields and how they can be applied to each other.
In Bayesian learning, the posterior distribution is proportional to the prior multiplied by an exponential weight, which is the negative log likelihood. This is analogous to statistical physics, where energy or Hamiltonian is a function of the parameters, and temperature is inversely proportional to the sample size. The Boltzmann weight is defined as the exponential of the negative beta times the Hamiltonian, multiplied by the prior. The partition function is the normalizing constant for the Boltzmann weight, and the model evidence is the partition function at beta.
Partition function is a generating function for physical quantities in statistical physics. Bozeman distribution is the normalized posterior weight which corresponds to the posterior. Free energy is defined as negative log of the partition function. Average internal energy is derived from the free energy by taking the derivative with respect to beta and integrating the Bozeman distribution with the hamiltonian. Average energy in 36 world is given by the derivative of free energy with respect to n. Beta is a variable in this case and n is primitive.
Internal energy is related to generalization error, and heat capacity is related to the partition function. Heat capacity describes how much heat is needed to raise the temperature by one degree, and can be derived from the partition function by taking the negative beta squared of the second derivative of F. This is related to the Bayesian generalization energy, which is derived from the discrete version of the log likelihood.
The hamiltonian is equal to negative one over n times the average log likelihood. Using the large number law, this converges almost surely to the KL Divergence between the true distribution and the density of theta plus the entropy of the true distribution. The part that does not depend on n or the data set is isolated and defined as the Kullback-Leibler Divergence between the true distribution and the density of theta. This has the form of a random variable minus the mean of that same random variable.
Hamiltonian energy is composed of two terms; a first term which does not depend on the random variable, and a second term which converges to the mean of the random variable as the number of data points increases. The first term is called the H naught term, and the second is called the quench hamiltonian. This second term is a contribution from random defects, which are introduced to the lattice to make it a disordered system. This produces an annealed average which is calculated by taking the average of the hamiltonian over all possible samples or defects.
The speaker discussed how to make analogies to statistical physics by talking about annealing or quenching and using SRT notation. They proposed assumptions such as a unique ground state, positive definite feature information metrics, and that the local minima have energy bounded away from the minimum. This allows for the large n limit to dominate the neighborhood around the data head.
Synthetic region is large enough that any state away from the ground state has exponentially decreasing probability. Laplace approximation is used to introduce a new quantity, the free energy, which is equal to the ground state energy plus a logarithmic term and a normalizing constant. The free energy is interpreted with a geometrical lens, and asymptotic consistency is guaranteed for regular models if the sample set is large enough. The best guess parameter will be the one closest to the truth.
D12 login is a measure of complexity in a statistical learning model, related to the number of dimensions in which random fluctuations in the data can cause the model to depart from an optimum. This is analogous to the asymptotic equipartition theorem in physical mechanics, which states that each degree of freedom contributes around half the temperature to the overall heat capacity. The two terms can be re-written by taking the logarithm of them, resulting in a term related to the determinant of the covariance matrix.
The speaker is discussing the concept of model selection, which involves comparing two models with different accuracies or energies. If the sample size is large enough, the higher energy model will always dominate. The complexity term is also considered, and if it is smaller the better. The robustness term is also discussed, and it is determined by the Hessian which is a positive definite matrix. Its eigenvalues correspond to the size of the axes of an ellipsoid, and the flatter it is the larger the volume of the ellipsoid. This volume indicates the goodness of the distribution.
In order to model data, an ellipsoid is used around the ground state as a hypothesis. The determinant of the official information matrix is another ellipsoid related to the Jeffrey prior, which is the volume of distinguishability. The ratio between the two ellipsoids has the meaning of how many almost true distributions there are. A smaller ratio means the model is more robust and less sensitive to the precise choice of parameter. The last term, V total, is related to how constrained the model is. If two models have the same accuracy, degree of Freedom and robustness, the one with the smaller space of distribution will win. Information Criterion is used for model selection.
Allah's claim is that a model selection criterion called the Razors Edge can be used to determine which model to use for a given phenomenon. It involves computing a quantity called RN, which is the asymptotic expansion of the true distribution, including parameters, optimization and ground state energy. It acknowledges the trade-off between accuracy and complexity, and is more careful than other modern statistics. The SLT version of this includes a log-log order term which is usually a constant.
When multiple Global Minima exist, the plus approximation can be used to obtain an asymptotic expansion at each of them. If the set of Minima forms a smooth submanifold of the parameter space, the Laplace approximation can be used on the normal coordinates, provided the Fisher information Matrix is non-degenerate. The Hessian of the Hamiltonian must also be non-degenerate. To integrate the tangential part, only one distribution is needed since there is only one.

Raw Transcript


right so um today I'll be kind of reporting on um a very old paper by father um in that's written in 1996 but I think the archive version is a bit um newer than that um so the title is statistical inference Occam's razor statistical mechanics on the space of probability distributions that's a lot so it's it's very disciplinary but I think a lot of that um a lot of things in here are things that we are familiar with and I'm here to sort of report the paper and make connections to things that we are already familiar with um and also to point out sort of The Germ of um various ideas in the SLT that is kind of already contained in this paper right so um the the the topics um covered is um included for one list so it's mainly about uh model selection in patient learning um making connection of that to statistical um physics and it include so that auckland's razor part is kind of a geometric um formulation of the principle of the morning of Oakland's razor right and we'll discuss a bit about phase transition transition in this in this that is discussed in this paper right um there's a note though um that I need to State upfront which is um this is a red Q the model setting right that's a very big departure from our sort of the main focus of this seminar um the the the the title of the paper has is a bit of a giveaway already so it says that um it is doing the study um the inference and describing selection criteria on the space of probability distributions and as we go along we will observe that what he means is that we don't really care about how how it is parametrized we only care about the set of distributions we are using to learn a particular to to learn a data generating generating distribution right um another note is that I will be using annotation in the paper but when I discuss um if their correspondence to SLT I would use SLT annotation and Convention Okay so a word on the foreign selection method so some set up so as usual um we have um a set of capital N events the paper presentation is a set of E's it's so e to E capital n so drawn i i d um from a true distribution that is denoted Little T which can be confusing so the question of model selection is given two I'm going to call them sets for now two sets of distributions so I have a set a and I have a set B both of them uh the subset of the space of distributions I'm not going to spell that out um uh rigorously but it's um the set of probability probability distribution that is supported in the same Central space as the true
distribution so encode that just disk right um so we wish to the problem of Honor selection is that we wish to ask which one is better for learning T if we are given this set of events and we are using by learning we mean we are using um Bayesian learning so meaning to to predict the probability of a new event we are going to take um the weighted the posterior weighted average over distribution inside each of these AB sets and ask how close that is to the true distribution right well um the basin base theorem tells us that the probability of this and this this set a this set of hypothesis given these events or given these evidence um is equal to well how How likely do you think a this set of a is before you see anything the probability of those events occurring and then the probability of that event given that you are assuming a is the correct term okay um and this quantity here well now we introduce um the fact that we are not just talking about any sets of distribution we are talking about models so A and B are model and by that I mean parameterized models and which means that that um so so a is the set of distribution on E parameterized by some Theta for Theta in some parameter um that is R typically right which means that this Factor here is equal to something that we no right and that is um that's known as model evidence okay so if we manage to calculate every quantity in here so we if we get model evidence and since we are um working things out in sort of the theoretical sense people might come in with different value of P of A and P of B so those are probably subjective so um theory-wise we don't care about those and if you are comparing p a e e versus PBE um P probability of e probability of that event is just a normalizing constant so really we're just comparing model evidence right so this is um this is on a high level what Bayesian model selection tells us to do if you have two different model pick the one with higher model evidence right so the rest of the story as we are familiar would just be about compute um getting a handle on this model evidence which is usually a very daunting task um before that though um there is something to do which is a choice of Prior need to be made first and that's just a perennial debate on the podcast what choice of Prior need to be made um but in some sense this paper Is Us as mentioned in the title is more concerned about the set of distribution than it is about how they are parameterized um so the parameter data is really just
an index um so there is already an implicit assumption that the map from parameter um to model is one to one right so it's a unique index the parameter space is a unique uh in that set for um the space of distribution right um so here we want to um if you want to have a prior on this parameter space we wanted to reflect the fact that we care more about things in its image than how it is parametrized so um we want our prior to be um re-parametrazation environment another way to say this is that we are kind of embedding our parameter space into the space of distribution right and two okay we want um prior to the parameterization of variant but even in the image even the image in the space of distribution we still need to choose which means to choose which before we look at data which distribution is more likely than the other but um the most usual choice is to say we shouldn't assume anything we should be the most parsimonials um and choose something called follow a principle called uninformative prior so we don't assume any information before we look at data right um and popular choice is uniform prior or maximum entropy prior and here we are applying that principle not on the parameter space but in the spatial distribution so basically we want um distribution in a to be equal likely quoting two okay but then this is a statement about the Continuum there is a Continuum amount of description that we're talking about so we need to deal with that right which um I um I'm going to say that I will there is a huge section which is I think according to other authors uh an original contribution of this paper which is to discuss if we want to apply the this this principle um we can arrive at something called the Jeffries prior and um and it is characterized um so that the author gave an original contribution which is justifying the use of Jeffrey Pryor's by um stating that um by doing by doing this counting by by doing um okay let me let me summarize that a little bit so okay so we have infinite number of distributions um but but we can sort of discretize them by asking um if we have two uh let's say two parameter Theta one and data two are representing P of data 1 and P of data two so we can ask um are they equal are they the same um at a fix Ed resolution level so what this means so what is so this is a this is the concept of statistical resolution so um one way to think about that is if we have um if you are given a sequence of if we have a sequence of observation
um x i drawn from P of theta I Theta 1. so if is this so if I have n of them is this enough to distinguish the um that if we only I can only observe the access can I tell that it definitely it definitively comes from P1 and not P2 right and and then I need not only do I need so as n goes to Infinity then um if P1 and P2 are is is is actually different distribution then you can almost surely tell if you have a large enough sample size sample size but for finite n you can also distinguish them you can you can ask whether or not they are distinguishable up to a confidence Epsilon um right so the idea is that you pick an N pick an Epsilon um partition the set the the partition a the set of distribution into things that are equal to each other so um take the equivalence class under this identification and assert that this finite set of things have equal probability and then you take n and Epsilon so NG infinity and Epsilon to zero and handle all the technical difficulties and you can you can arrive at Jeffries prior so the the the important idea in here is um so if you read the paper it uses something called science Lemma and just to be careful it is not the more famous Stein Stein Slammer it is um another sign that I'm not called um code with channel so it's Channel style is like if you want to look it up another way to get to to the desired prior distribution would just be uh going from I guess going from information geometry so um okay three steps so so we want a prior W of theta such that if we um re-parametrize by a smooth change of charge we get so W2 is after the change of chart do you mean our fire is the new coordinate so 5 Theta is that yeah oh yeah okay twice in your coordinate so change your coordinate that's the determinant I really don't get the motivation for this um yeah so there's the whole Jeffrey's prior thing I know like Thomas really emphasized that Bella subramanian cares about it like I get why it's theoretically nice and it's a kind of in a way it's a prior itself right it says that absent any other information it shouldn't be that somehow like this is a kind of reasonable prior about what a model should be like I guess but in practice like the process of searching for the truth does not treat all parameters as equal say in a neural network right it's like um so I wonder if if actually this reflects a prior that's too pure like it's through like too physical like too physics oriented two symmetry oriented and actually not really a reasonable assumption about the
process of searching for the truth in the systems that we actually care about I don't know if you have a response to that yeah I do have a response to that and that and the the response is that basically I agree and I um I do think this is kind of a major major roadblock to um understanding so major roadblock to to to even see that there is a possibility of singularity um so because like once we continue with defining Jeffrey Pryor you just wipe out all possibilities of Singularity because traffic fire is is not defined as singularities right but I think that I think the the giveaway is really the motivation of of this paper come from comes from a very closet um probability Theory where um where things are usually analyzed first in a discrete case so like the the set AMD the set of distributions are sometimes uh is usually assumed to be disagree so it's like yeah it's like choose any prior and then um if it is discrete or finite um then the the uniform price is easy to easy to Define if it is infinite well you you need some regularization but then certainly if in the Continuum you suddenly need to do this discretization and I think this is where where this come from the focus is on the the space of distribution and they kind of don't care about uh degeneracy in parametrization yeah I wonder if I don't have exactly the opposite kind of orientation ideologically which is uh I don't care about the prior because the prior disappears in the asymptotics as far as I'm concerned so that makes me not a statistician I guess right so I just don't care what the prior is basically um for me the prior only ever shows up when I try and do integrals and it's a nuisance if it doesn't have a good form like in Spencer's thesis or something but I suppose if you're coming from by the subramanians background it's rather the opposite that you don't care what happens in singularities because you don't think about that um and somehow the the need to choose a prior seems like this really objectionable thing and therefore if you can choose this natural one that's always there that seems like something you should do is that kind of where the ideology is yeah yeah I think so um except that's uh okay so uh when I get to the thermodynamics point of view there is a sense that the prior corresponds to um density of States okay not that CSA is in sorry sensitive states that people measure so it's like um um I I call it call it called alternate ties my my physical system in some way and I go ahead and measure the
degeneracy like at a certain energy level there is a that like you can have multiple different more configuration than other energy level and things like that um uh so in in this in a sort of thermodynamics connection sense um the degeneracy is definitely very important um and which means in some physical system people choose the coordinate first so so the focus becomes what is the prior on on their parameter space um yeah I think we will touch on these points um again um okay so um so let me sort of Define um the Jeffers prior in a in a very quick and dirty way um and not very rigorous so we know um I think everyone here is familiar enough with the kale Divergence which sort of measure um how different is it it's um how different your P Theta 1 and P at Theta 2. um is to each other and that that kind of difference is what what I described uh earlier so how how many ends do you need to distinguish um to um to distribution at level Epsilon and US n and Epsilon goes to Infinity um right so this is not a this is not a metric but um we can do a Taylor expansion center around Theta 1. so that this will be zero plus zero plus well the quadratic term right so the um the constant of the term is zero because I'm doing doing um let me actually Define term time Theta is equal to Theta 1 minus Theta 2. sorry there are two minus Theta 1. so I'm doing the uh Taylor expansion near Theta 1. of this expression um so we know that when Theta 2 equals Theta 1 this the it's it's zero that because the care Divergence is zero when the distributions are the same um and the second the the linear order term is zero because we know that um the care Divergence is never this is always non-negative which means that if you hit zero it is definitely a local minimum and hence a critical point hence the linear order term vanish and therefore the first non-zero term is the second order term and um and J is the official information Matrix key value the F data one right so this Matrix do Define um so it defines uh Romanian metric on the parameter space right so and actually this is kind of a general property of divergences and every time you have a Divergence you you get the the quadratic form you can derive the credit form from it and Gathering money metrics everywhere it's non-degenerate right yes that's the next um next very important point sorry doesn't work is MJ of theta is singular so it doesn't work in neighborhoods where um they don't um data is singular right so um okay um okay so we Define so using this we
Define our pre-parametrization using the properties of a Manu metric we Define our parametrization in varying um prior to the Jeffrey prior which is defined by the square root of determinant of this metric but obviously normalized so that it is actually a density okay even just looking at the form of this prior we know that this prior is not supported whenever the official information trick is is singular meaning the determinant is zero so it immediately wipe out at any place in the parameter space um where things are singular right and second thing to notice is that the denominator um the the normalization constant might not exist is um if the integral does is not convergent um but if we assume that our parameter space is compact which is a very common assumption and we will assume that throughout then this is not an issue okay um any question before I move on like the canonical choice uh so the kind of it okay it's kind of Nicole According to some definition of commonly called it's it is it is the prior that satisfied um it is it is a uniform prior on the space of probability distribution so you have a model and you look at um the set of all distribution the model parameterized and you say that I want every distribution in that set to be as likely as any other and um put in some smooth smoothness condition um you get Jeffrey's prior yeah yeah so if if the if the Fisher information Matrix indicates that changing the parameter a little will change the model a lot so somehow you're in a region where few parameters describe many models then it will put more mass there and if you're in a region where big changes in the parameter have relatively small changes in the model so that many model many parameters cover only a few models and then it will put relatively less mass in order to achieve what Edmund was saying if that helps yeah great thank you the extension of that is that if there are there's a Continuum of parameters that all give the same model then that's going to be a singularity of the official information Matrix so that's going to have zero Mass or undefined or something yeah I think undefined is the better one okay so um so the next order of business is to list a bunch of connection of Asian learning with statistical physics and then sort of start doing mathematics as if this is physics that can be a good bad thing um depends on depends on who you ask right I'm going to take this from um from a different paper by Lamont um and Wiggins 2019 um well that paper is kind of an sort of
extension of subramanian ideas and flesh out a few more things so um I think this is still conscious the same body of work so we start with observing that the posterior post the real distribution is proportional to the prior multiple the prior multiplied by some exponential weight okay going to write it in a funny way which is probably not funny for everyone here because this is a familiar thing Okay so um what we have here is that we have the log likelihood here and this is um average negative log likelihood which we know eventually converge to um to to a function of data right that's the prior okay so it's kind of the connection to statistical physics should sort of be gleaned from just this equation so let me draw a table space nope okay so um so there is there is a function um namely this function here which given a Theta you um you compute some none uh uh you compute a number that improves if the Theta is closer to the true model right and the analogy that is that's in statistical physics we have energy or hamiltonian um h of theta and going across to um Bayesian learning world we get um the log well the negative log likelihood which is this function here okay and that also means that we are thinking of the parameters as state variables also Deno as data okay in um in statistical physics we have um temperature t um and the inverse one on t temperature and the analogy to to looking at this equation is n sample size is n is equal to one on T is the inverse temperature um the prior sorry um the prize yeah is like density of states so um so usually the exponential weight um tells you the relative probability of being in two states that are one and the R2 according to the hamiltonian but you need to multiply that by the degeneracy of at that State and that's density of things um so the boats um men wait so given things are defined are both given a hamiltonian at a temperature T is defined to be e to the negative so that beta B be the inverse temperature e to the negative beta times the hamiltonian that's the boat's weight and then multiplied by the price okay so we have the same thing in Learning Land I don't know why repeat that but now this uh these things are defined in terms of statistical quantities and then there is the partition function which is the normalizing constant for the boxman weight so Z at beta is equal to integral detail of running Arc space of the boats in a way and here we have ZN is equal to uh what we were talking about before which is the model evidence
okay usually in statistical physics once you once you got to once you get your hamiltonian you write down your partition function everything every other quantities can and should be defined from the partition function so it's a kind of a generating function for physical quantities that we care about um before I talk about that I should say that the the boatsman um distribution distribution is the just the normalized postman weight and this corresponds to the posterior okay and okay so quantities that we might care about in statistical physics say the for example the um average energy The Ensemble average of internal energy well that is um defined by um let me let me Define the free energy first sorry so the free energy is usually defined as negative log of the partition function and the same thing so free energy F sub n is equal to negative log of the partition function um or the evidence in this case so average internal energy is given by derivative of the free energy actually have that you can kind of see why that is the case if I use notation here so if you sort of take the derivative d d beta into into the under the integral you get just a hamiltonian in here and then you are basically integrating the postman Bozeman distribution against hamiltonian which gives you the average energy that's kind of the theme of how the partition function get used as a generating function um okay what about what is the average energy in 36 world so if you if we look at that through so the the energy should look like um DF DN well I'm using capital N the fdm of um dfdm right and um uh Okay so there are questions I'm just trying to pass the um it's just that your formula has a beta in it right I mean the partition function on the right hand column actually has a beta in it sorry beta means n sorry yeah and in the line above as well uh if that's changed then I understand that's n sorry the beta is n yeah and it's probably this is nitpicking at this point but uh it somehow T is one on N right like in on the left hand side T is a primitive quantity and then beta is introduced but on the right hand side n is primitive yeah um primitive well it has a meaning I don't have to explain it on the right hand side on the on the statistics side yeah yeah I I actually think that um well I'm not um settled on the interpretation but I actually think that in this case when we are doing this analogy it is kind of a um a variable because notice that I'm I'm I thought someone would call me out
on this because like in the first line the hamiltonian um the log likelihood has an n in there yeah right but but really really that there isn't because this if I if I write this as um this is negative one on N sum over I equal to one two n of P of the individual things it's it's really so so really I'm I'm thinking about um the the uh the large sample average version of this thing like um I'm taking this so it's it's almost like I'm thinking about another quantity called um n Prime um and then I'm changing my posterior distribution to ask um what about n Prime what about another n Prime um I'm not following the N Prime thing so Empire is kind of an independent thing um it's not is is not uh okay okay you're saying that not from guys yeah yeah okay and this is to do with like the stochastic fluctuation or whatever so the the h somehow uh I mean you didn't actually write anywhere H is equal to that formula right yeah right okay so so then in the formula for ZN there's kind of a one on n in the H and then it cancels with an n in front of it but that's not not stupid because the the one on N that's in the Hamilton and you kind of think of as n being sufficiently larger that's that's really an average and the N is like kind of like this you could factor out the difference between that and the and the actual like true average value and then you'd still have that in there uh so that's like a real thing and that's what you mean by the N Prime is that the real thing is that right yeah the the real independent thing that they are actually you can take their derivative and stuff like that so even though it is discrete we shouldn't really do different derivative yeah but yeah um so in prefer analogy but only uh but works in various different regions um okay right so um uh this internal energy actually corresponds to um uh so if if I take the discrete version just to recall if I take the discrete version of this if I have FN plus 1 minus F and that is related to the Bayesian generalization energy so this thing is actually the is actually related to generalization error so somehow internal um energy so is related to generalization error um and then there is in statistical physics we can also talk about heat capacity so if I inject how much heat do I need to raise the temperature by one degree right and that is um that can be derived from partition function by negative beta squared until second derivative of f and here um you can also try to look at the analogous quantity
Del F Del Square F Del n Square um and uh this is actually equal to the learning coefficient um so when n is large enough the leading of the term is actually equal to well is equal to D on2 for regular model and in this case but it will be equal to the ROCT in general okay um any question about that if not let's get back to um our original task which is let's um so the model evidence now is written as so that's a question also I got really a silly question but it's a learning coefficient here like the RCT or the one used in like SGA wasn't the same thing uh I don't think there is a learning conclusion in SGD that's called learning I think you are talking about learning rate um oh sorry yeah yeah so learning coefficient means uh the ROCT so the um uh the the rate of learning as you increase your number of sample by one the decrease internalization error as you increase your um sorry I was away for a minute so maybe you said but it's actually in Bella subramanian's paper to talk about this derivative as the learning coefficient is that a question or is that a statement that's a question yeah oh no it is not so as I say that that that table is based on Lemon paper which is also based on past permanent ah okay yeah that's cool I mean obviously I agree with that I just hadn't seen it being explicitly written somewhere right um so using um the everything that we have talking about so far the the evidence now have the following um expression so there's there's this um prior term e to the negative n okay so um rewriting everything using the physics analogy that we have we have the hamiltonian is equal to negative one on n log the average log likelihood um and using the ID assumption we get this I love large number we know that this converge in this in distribution converge almost almost surely to um the KL Divergence between the true distribution and the density of theta plus the entropy of the true distribution um yeah that that I hope that's okay to Azad yeah okay so we are going to has alluded to earlier which is let's um let's make this hamiltonian let's isolate the part that um that depends on and and I select the part that doesn't depend on n or even the um or even a data asset so there's this part let's define that part to be this function which is look at the care Divergence between the truth yep and Define HD [Music] to be equal to um well the original hamiltonian minus um yeah okay notice that um this has the form of a random variable minus um mean of that same random variable
and which means that as you collect more and more data so um so sorry the first term is doesn't depend on E capital e um and definitely depend on n the second term is something that should go to zero so the random variable converges to its mean by the lower large number as n goes to Infinity right so we have so sort of morally speaking the the hamiltonian energy function should be the first term um okay I I have a question um in here which is um balasubramanian call the first term the H naught term so we have H of theta is equal to H naught of theta plus h d of theta okay this is what I'm going to call this term the quench um hamiltonian and this the the next term is um so contribution from random defects um well so where does the renderings come from it comes from the randomness in the samples given by e um I don't actually so I'm I have a very bad understanding on uh on the on disordered uh the statistical physics of disorder system but my current understanding is that this term is more like the annealed um hamiltonian so the analogy that I'm um that people usually use is that um we'll study uh let's say we are studying the surgical physics system say the easing model um so the easy model has um have a hamiltonian given by uh given by the lattice and and for each configuration of Market sites we have an energy function um but we can make this a disorder system by introducing um by by introducing defects into the lattice and and which means which which makes which effectively give a random perturbation to the um defect free or crystalline um model and if we compute quantities using using the full hamiltonian and fix fix a particular data set then that quantity is uh well you can compute that quantity again and again for a different different sets of defects and once you compute the quantity and average over all possible defects I think that is uh that is called the the sort of the disorder average but if you sort of ignore uh ignore the mechanical validity and and sort that um uh expectation operator so which means that let me compute um the average hamiltonian if if I average over all possible samples or all possible random defects I thought that is more like the annealed average Dan Dune or anything about these this is not very important yeah not enough to give a good answer immediately I think it's probably at given our current interests a bit of a distraction like I think it doesn't matter too much I don't think the physics is offering us much uh yeah except if you want to learn I use
replicatric or things like that in here which I don't think we are anywhere close to doing so yeah that's a good point that someone should give that talk soon maybe me okay um but the upshot is that we this is the expression but we can make the analogies to statistical physics uh better by talking about the anneal or quench version I don't know by talking about by averaging out on all the defects okay just to show that this thing here using um SRT notation is unless we are used to little n m k n W and this is um the other version using kn so technology is K is our hamiltonian and KN is the disordered version of that okay we will go ahead and um do asymptotic expansion on the expression um so balance Romanian is using this analogy to say that well if n is inverse temperature the large end limit corresponds to very very low temperature very very cold temperature um so and there is uh entire literature in statistical field theory about low temperature expansion um I would admit that I look at the um reference given and it's a it's referencing an entire textbook so I didn't actually go ahead and read what that means um but I think we have an alternative route to um getting that so I'll be presenting that instead first we have a bunch of assumptions so to even start um doing this which is we have a unique sorry we have a ground state that the ground state um data head to be the ground state of our hamiltonian well that just translates to it being the mle um meaning the argument over all the parameter space of our Hamilton okay um and we required that the feature information metrics is um positive um definite that well otherwise um uh what happens if it is not positive eminent then it is positive semi or it's definitely positive steps positive semi-definite which means that it is singular which means that the prior Jeffries priority is zero that uh which means which means um done which means we that parameter is ruled out before we even look at a certain nuggets so we also require that um the mle the ground state is in the interior of the compact or the fact that we can Define the Justice Prior already implies compactness and finally other local Minima have energy bounded away from from h of the minimize that by by a positive constant B so what are the that implies this is a unique um ground state right so any other minimum Minima minimum has higher energy by at least B right and and finally n is large enough uh that the neighborhood around data head dominates well this is a very poorly
wooded sentence that I wrote um but it essentially means um we are we are talking about synthetic region um and N is large enough that recall that anything uh any state that is away from the ground state has probability equal to exponential e to the negative n of the difference in energy which means that anything far away from it is um is going to be exponentially decreasing okay and from here we will just apply LaPlace approximation because see um or Federal Point method or really mods uh Lemma but probably a sort of a spell out version of most server or uh statistical field Theory which I don't know anything about which is the actual thing that um used that give us so he introduced a new quantity is equal to negative log pae which means that this quantity is just a free energy right um and is equal to n times the ground state energy plus d onto login okay I think the first two term is familiar to everyone it's the older one term in this very specific case that um could be of interest to us so there is the other one term is equal to half .org determinant of J the visual information Matrix divided by determinant of I explain how that is later okay so I um is um fashion of page of the hamiltonian right um which is quite distinct from the official information Matrix um uh for Final End okay and V total is that normalizing constant for the projectory prior which means that it is a total volume um total volume of the parameter space as measured by as measured by the Jeffrey prior which is also kind of how how many total distributions there are um in uh in in this model A okay so the next thing is to interpret this thing with some through some geometrical lens okay the all the order and some well that's n h well that's that that is accuracy oh I should mentioned that um so if if you're thinking of the free energy expansion in SLT this is not quite the first term the first term is um in SRT is n um kw0 right actually let me just write the next so um and w0 is actually the argument of uh not the perturb not the not the hamiltonian with defect but um the the unmute or quenched hamiltonian right those two things uh is is close to each other are asked at n goes to Infinity if um if our model has something called asymptotic consistency and that just means that if you uh if you have a large enough sample set your your best guess parameter will be the best parameter will be the parameter that is closest to the truth uh and for regular model that is that is guaranteed not for singular model
so the first time we present our accuracy so that is just how close is the closest um distribution uh in a to the truth so that that description that I just say is is this closer to the SLT version but here there is a difference between data hat and data star I guess okay the order log and turn um well it's D12 login so putting on the statistical physics hat this is degree of freedom or in the statistical learning hat that's the most complexity well for a regular model well this dion2 comes from uh so in the analogy two settings of physics this is um this can this come from asymptotic equate partition theorem um and and that's kind of a Elementary statistical mechanics ethereum that says that um every degree of freedom contribute around half times the temperature well there's a Boatman constant somewhere in there so I'm using units like that is one um half t uh two thermal energy right so so remember um dion2 is actually uh is in a regular case is our our learning coefficient um which based on our correspondence is equal to the um what the leading term of the heat capacity um and the the same kind of thing in physical mechanics tells us that if um every degree of Freedom um so for example uh gas in uh three dimension has uh huh did I forget something I was about to say six degree of Eden because um three for spatial Dimension and three four velocity that's right times the number of particles times another particles so so um the the capacity becomes um sixth half nkt I thought it's three half nkd um what am I missing here um I remember three half MKV where I did the uh yeah I don't know off the top of my head yeah okay uh but this is um I'm sorry I'm I'm just saying this here to to say that this is um the analogy still holds in this case um and from the statistical learning side um this is um this is related to the number of ways random fluctuation in um so number of independent dimensions in which um just fluctuation in the samples can cause the model to depart from the optimum right so um yeah so if you if your if your model is very very complex then uh then sort of noise in the data can um can use those extra degree of freedom to overfit so and then the I guess less familiar term is the older one term so I'm going to rewrite those two terms by doing some isability of logs to rewrap them this way okay so so that term is um so if we look at the top term the determinant of The Passion of sorry to interrupt you Edmund I guess I don't understand what the login is doing I
mean the story you were just telling was about dion2 basically right hmm But but in the first line I mean that whole thing with the N it is like a meaningful thing no no I I was actually the accuracy is is it should be should be about H just a h and so on the right hand column I'm only talking about the coefficients yeah okay right because so but even in the ordinary term if you have two states with different accuracies or different energy um then um then if n is large enough the uh the lower energy term always dominates right okay so you're you're thinking of this in terms of model selection again or something right yeah you're comparing yep yeah yeah so you can comparing two two models uh one model that is less accurate and if I if my n is large enough um then I will always choose the more accurate model okay but in in uh if n is large but not large enough for uh for the first time to completely dominate we we look at it next time we got expansion and that's um the complexity term so smaller complexity the better um and then if uh and it's still smaller yet we look at the order one term and the other one term is called uh code that was my code this particular term the robustness okay uh I will explain why he copied that call it that so um the determinant of the hashing well that's kind of um okay if the Hessian is not degenerate then we can um and and we know that we are we are expanding um the log likelihood near its uh unique Global local minimum Global minimum so the Hessian is a positive definite and therefore have all positive eigen values and those eigenvalues correspond to um the uh the the the size of the uh axes independent axes of an ellipsoid yeah what was that oh it just said flatness right yeah yeah um that is right so flatness is the larger the volume the the higher the flatness yeah yeah exactly so um so I was about I was I was about to say that the tenant of that is the volume of the ellipsoid um centered at um data hat and um well it's just there's a more low brow way of saying it which is that locally that means something is expanded as Lambda 1 x 1 squared plus dot dot Lambda N X N squared and the lambdas of the eigenvalues so this is one one over the square root of that product so the more curved it is the higher the Lambda the smaller this number so the flatter it is the smaller the Lambda the larger this number yeah so volume of the left side um so the volume of your site containing um containing good distribution so by good distribution so we are we are
in a neighborhood of the nle here we are in the neighborhood of the ground state so everything around it is uh so I'm looking at ellipsoid around that and everything contained within within that ellipsoid is close to the ground state so those are good um good hypothesis to to model the data right but what is um the other term here so the determinant of this of the official information Matrix well this is another ellipsoid but this time this is related to um by the way this this older one term comes from the fact that we are using a specific prior called the Jeffrey prior so this term is related to um the volume of this distinguishability so this is um this is coming from how the traffic process is defined which is um what is the typical volume um within which I can't distinguish um uh distribution in that right um so the the ratio ratio between these two has the meaning um how many uh distinguishable uh almost true distribution they are right um the more uh so so okay let me draw a picture so the picture is uh this is um this is that's the mle and that's the ellipse around the mle and uh distinguishability is um is very small meaning that this um there's a lot of essentially different good guesses what that means is that um you don't need to specify so if your mle change a little bit and go here and and becomes that becomes that that is actually okay um because you are still um uh you're still in the neighborhood of good guests um so this is saying uh the more robust you are the less sensitive you are to the precise choice of parameter okay and there's a uh we have kind of already discussed the last term let me write it in the Q Pi on D and two so the last one again everything here is uh the smaller the better um I'm writing in a way that where it is the smaller better so V total is just a total so this is related to how constrain um our model is so um if if everything before so if your model have the same accuracy if two models have the same accuracy same degree of Freedom um and same robustness then whichever model um uh has smaller space of distribution which um that one wins okay um okay uh I'll quickly go through um oh so let me let me say say a word about um where is opens razor in here um so the idea is that we should be using now on the correct board so the razor for model selection call that r n of a model I I think a sort of less philosophically inclined language for this will be information Criterion um um in sort of modern patterns um so there is the RN um
Allah claim is defined to be the quench version of the evidence okay let's let's explain that a little bit so um um right so this is um this is H naught there you know um okay so this is saying if you know what a true distribution is okay that's a big if but so suppose you want to study um a range of phenomena and maybe you you know uh you know what class or true distribution it is um and you have various model to learn um distribution in that you you and you want to know you have two model and you want to know which one you should use then this razor is saying compute this RN quantity this razor quantity on um Model A and compute this razor quantity on model B whichever is larger which are so this is related to the evidence so whichever has higher evidence use that one um okay I uh I want to emphasize the fact that this Criterion depends on M right so um first thing first is that this is a a tall ass this is really hard to to calculate so what you would do is to um is to write out the asymptotic expansion of this which is just what we have been given before h onor of data store plus D12 login Plus actually minus uh robustness and minus constraints right so um you you have you have compute these quantities for two models and compare compare their leading other terms and you can do you can compute each of these things or the second term you don't need to you don't need to compute just count parameters and the first term you do optimization to find out um the ground state energy and so on and so forth right obviously this is a regular uh regular models story um um I want to emphasize that this is already um uh a a a more careful definition of model selection criteria than a lot of modern statistics it acknowledge the fact that it depends on the truth right it is not um so colloquially people you sometimes say say say things like um complex model is bad or things like that but here it at least knows that there is a trade-off between um accuracy and complexity and then there is even higher order term trade-off um here I should also want to say that um the SLT version of this is the very familiar in SRT there is actually another term another order which is the log log and Order which for practical value of n is probably just constant and you know SLT um we haven't have we should we can probably Express the constant constant all the time okay um uh for the next couple of minutes I'm going to mention some generalization of uh that is already mentioned is balasubramanian okay I'm just going to
do it colloquially so verbally sorry um so if you have multiple Global Minima you can still fix this process um that's a one one assumption violated but what it amount to is um you Applause the plus approximate do the plus approximation at each Global Minima which means that you get you get this asymptotic expansion at each of them and then take the sum right so the first term will still be um uh uh will be will be a constant multiple of the first term and the second term second term will be down to another constant multiple but the other one term will be different right we call it the other one term involves at least um so now you have a different Polo Minima and the other one term will be a proper sum of different quantities okay um Another Second Step of generalization what if um the set of Minima forms a smooth sub manifold of your parameter space well in this case um so your parameter space data and then the set of Minima looks like say like that um in that case you would choose coordinate that can be partitioned into two parts one the tangential part and one the normal part and then do the LaPlace approximation on the normal coordinates and and wait but it's not it's a great thing as well it's not enough for that set to be like a smooth sub manifold you'd need the Fisher information Matrix to be non-degenerate as well that's a prevailing assumption I guess right no actually no um so uh uh so so the the this is the local minimum of the energy function which which doesn't doesn't involve um doesn't involve the prior so this is still a well-defined concept right so this it is correct that the prior will be zero there but here we are basically saying um that oh no I'm not talking about the prime Min okay so it would be the same to say what I'm saying is that the the Hessian of the hamiltonian would have to be non-degenerate yes yes um so uh uh so but that's yeah oh sorry but that's the hypothesis at the beginning that it's regular I mean I guess maybe it's yes yeah yeah okay right the thing is like so basically everywhere where you want where you are taking determinant you you are taking the determinant of a matrix where there is a bunch of zero eigenvalues so basically don't do don't don't take the determinant of the full Matrix take a determinant where the eigenvalues is positive but all the other dimensions integrate out separately and the the integrating the tangential part out is should be easy because um uh because there's only one distribution there right so so there's