WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
Statistical learning is a process used to approximate truth by minimizing a log loss function. It relies on the law of large numbers, which states that log loss converges to its mean and variance of the terms making up the log loss. Edmund's equation and Fischer Information Matrix are two parameters used to explain trajectories in deep learning. Main Formulas One and Two are used to calculate the free energy of a system, and adapted coordinates are used to divide a singularity into two to the n parts. Ultimately, the square root of k is introduced to make the variance constant on the pullback of the set of drum triangles.
Statistical learning is a process used to approximate a truth generating data. It involves two steps: a model with a prior, and a log loss function to get to an estimate (p hat) via minimization. The law of large numbers and central limit theorem imply that the log loss will converge to its mean, and its variance is a function of w which is equal to the variance of the terms making up the log loss. When substituting in the true parameters, this expression becomes zero.
The likelihood function is a random variable which fluctuates with each training sample, and is illustrated using a normal distribution. As the sample size increases, the distribution of parameters around the true parameter becomes tighter and more Gaussian, a property known as asymptotic consistency, normality and efficiency. In a singular model, such as a 10 hedge regression, the measurement noise follows a normal distribution with variance 1, and the true parameter becomes the most likely parameter as the sample size increases.
Edmund's equation is an odd function with two peaks and is not Gaussian. Increasing the training sample size from 500 to 5,000 makes the true parameter move closer to the axes. The Fischer Information Matrix is a measure of the identifiability of a model, and when the true parameter is placed on the axis, the matrix is degenerate and the KL divergence will not have a quadratic approximation. The plots illustrate how the matrix is impacted by the true parameter, even if it does not lie on the set where the matrix is degenerate.
Fischer degeneracy locus is a set of parameters where the log loss function or posterior has a non-trivial density. At a finite n, the true parameter may not be on the set, but with increasing n, it will look like a regular model. The origin of the set is a singular point where the RLCT is the largest. A stochastic process called psi is introduced to study the behavior of kn, which is a normalized version of the log loss. This process is centered around the mean of kn and converges to a normal zero sigma of omega according to the Central Limit Theorem.
Main Formula One and Two are used to calculate the free energy of a system. Main Formula One involves an integral with a deterministic and stochastic term, which can be resolved into a normal distribution. Main Formula Two uses resolution of singularity theorem from algebra geometry to factor out zeros from a multi-variable non-isolated set, resulting in a log likelihood ratio equal to the k function. This process defines a random process everywhere on the blow manifold, with a second moment equal to a constant in the pullback of the set of true parameters.
K and L are two parameters used to explain trajectories in deep learning. K is the second order part of the Taylor series expansion, used to shift the function to zero at the origin, and L is a measurable quantity that is never negative or zero. Adapted coordinates are used to divide a singularity into two to the n parts by factoring out the zeros and creating a local chart, illustrated by a picture of a circle. This is done to make the free energy formula more singular and is used in multidimensional models. The main theorem involves extending a function to the complex plane and showing that the determinant converges to a Gaussian process. The square root of k is introduced to make the variance constant on the pullback of the set of drum triangles.
A statistical learning process is used to produce an estimate that approximates a truth generating data. This is done by factoring the process into two steps: a model with a prior, and a log loss function. The log loss function is used to get to an estimate, p hat, via minimization. The talk aims to introduce and motivate a main formula one to study the random sampling of the training data, and outlines a proof for it.
In inference methods such as maximum likelihood estimation and Bayesian predictive models, physical learning is used to clarify the behavior of dysfunction. To study the normalized version of the log loss, the law of large numbers implies that it will converge to its mean. The central limit theorem states that if the log loss is centralized and scaled appropriately, it will converge in distribution to a normal distribution. The variance of this distribution is a function of w, which is equal to the variance of the terms making up the log loss. When substituting in the true parameters, this expression becomes zero.
The likelihood function is a measurable function of the training sample and is thus a random variable. It fluctuates with each training sample and can be illustrated using a normal distribution. The top series of images shows the case where the training sample size is 20 and each image is a different draw of the set of 20 training samples. The darker and lighter parts of the images represent the control of the likelihood function and the diamond represents the true parameter. The maximum likelihood estimate (MLE) can be highlighted using crosses.
A regular model illustrates that as the sample size increases, the distribution of parameters around the true parameter becomes tighter and more Gaussian. This property is known as asymptotic consistency, normality and efficiency. In a singular model, such as a 10 hedge regression, there is a mirror symmetry across the line. The measurement noise follows a normal distribution with variance 1. As the sample size increases, the true parameter becomes the most likely parameter and the distribution around it becomes tighter and more Gaussian.
This equation is an odd function, making it symmetric under the transformation a b goes to negative a negative b. This means it is not a regular model and has two peak even asymptotically, so it is not Gaussian. The training sample size increases from 500 to 5,000, making the right hand distributions tighter and the true parameter moves closer to the axes. On the bottom most plot, the true parameter is on the axis, where the determinant of the Fischer Information Matrix becomes zero or the matrix fails to be positive definite. Above the bottom row, the model is not identifiable and is not regular, while the bottom row is the truly singular model. These two cases are distinguished by the fact that the bottom row has an unidentifiable model.
Fischer information matrix is a measure of the identifiability of a model, which depends only on the model and not on the truth. Edmund is showing where the matrix is degenerate and that if the true parameter is placed there, then the KL divergence will not have a quadratic approximation. Even if the true parameter does not lie on the set where the Fischer information matrix is degenerate, it is still massively impacted by it. Plots are meant to illustrate this.
Fischer degeneracy locus is a set of parameters where the log loss function or posterior has a non-trivial density. At a finite n, the true parameter may not be on the set, but if n is large enough, it will look like a regular model. The origin of the set is a singular point, and this is where the rlct is the largest. This is similar to the case of a phase transition story, where the likelihood of the non-zero l is still present.
The true parameter of a transfer function is w0, and the level set at that point has a lower RLCT. Even though L favors w0 more than 0, the stochastic version of Ln depends on the data set chosen and can sometimes favor 0. This influences the free energy, as it takes an average over the training sample, and the fluctuations are important. The two contributing factors to this could be that it is stochastic and that even for large n, there is a regime where 0 is favored.
A stochastic process called psi is introduced to study the behavior of kn, which is a normalized version of the lot loss. This process is centered around the mean of kn and normalized by a term to make the variance at w not constant. The numerator term of this process converges to a normal zero sigma of omega, according to the Central Limit Theorem.
The main formula one and two are used to calculate the free energy of a system. The main formula one involves an integral with a deterministic term and a stochastic term. The stochastic term can be resolved into a normal distribution, but only away from the set of true parameters. The main formula two uses the resolution of singularity to give an asymptotic expression for the integral, which can be used to calculate the free energy.
Using resolution of singularity theorem from algebra geometry, it is possible to factor out zeros from a multi-variable non-isolated set. Given a model p of x w, with w in a space of parameters, a resolution map can be used to resolve the epsilon neighbourhood of the set of true parameters. This map can be used to factor out the zeros and make a real analytic function. This function is the log likelihood ratio, and it is equal to the k function that is desired.
In order to prove Main Formula 1, a resolution of singularity is needed to get a map, kg of u, in every local chart which is normal crossing. This expression is equal to k g of u, which is equal to u to the 2k, and it is multiplied by k root to get f of x i of u. This random process is defined everywhere on the blow manifold, and the second moment is a constant, u naught, in the pullback of the set of view our true parameter.
The main theorem to be proved next time involves extending a function to the complex plane. This involves invoking results from empirical process theory to show that the determinant converges to a Gaussian process and characterizing the behavior of the determinant in a neighborhood around the trooper. The square root of k is introduced to make the variance constant on the pullback of the set of drum triangles. This standardization term is only for u naught and w naught.
K and L are two parameters used to explain trajectories in deep learning. K is the second order part of the Taylor series expansion, and is used to shift the function to zero at the origin. L is a measurable quantity that can be obtained from experiments and is never negative or zero. Ln is also never zero, and is used to normalize the denominator. K and L are both accepted by engineering and deep learning people, but the definition of C is not clear.
Adapted coordinates are a way of dividing a singularity into two to the n parts. This is done by factoring out the zeros and creating a local chart. This is illustrated by a picture of a circle which is further divided up into the charts. This is done to make the free energy formula more singular and is used in multidimensional models.
so welcome everyone um today this talk will i'm going to try to introduce and motivate uh what otanabe calls in his gradebook uh main formula one uh which is this formula written in orange written in brown here so this is going to be a kind of a softer talk to introduce and with more illustration and motivation than actual technical details and it's it's it might be uh i might show some questions to to the audience about um exposition and how how these things are presented um because there are some points in this story that i don't really have a good idea of why it is introduced this way the ultimate result is um undoubtedly useful but um the path towards it is perhaps um not as clear to me as um as i should i would like to like it to be all right um so let's start so um we'll go through some illustrate illustration of the effects of um what random sampling of the training data due to the log loss function and to the posterior and therefore to the posterior we will motivate the need for main formula one because of that kind because of the intention our intention to study the random sampling and then we'll outline a proof domain formula one um so just um some brief uh recap uh so in statistical learning theory we are learning we are studying systems um systems like like a machine or our brain that perceives um that perceives in some sense by a perceived data generating process so so these are the data first things with which we denote q um these are um generating process with stochasticity in there so it is a distribution so these are some examples of images with labels and the handwritten digits of labels and so on the perception would be a bunch of datasets generated by a bunch of data points generated by um the data generating process um and let's go to the next part oh i was an artificial speaker let me do that [Music] okay so um so we have the truth generating data um and our goal is to have a statistical learning process learning that produce an estimate that approximates ah that approximates the truth right so our way of doing this is um again via a parametrized model so this process here um this arrow here is um kind of factored into two steps which is uh we have a model with a prior so each point in this parameter space is model and this defines um a quite an important quantity called a log loss function let's recall its form again oops connective and some of the ways that we can get from the log loss function to p hat our goal is say via minimization
or taking averages so that's um the minimization is using the maximum likelihood estimate and taking averages is the bayesian predictive model that's that's saying that let p hat x equal to the um the bayesian average division average over basic average of a likelihood function over the posterior okay um right so the the the role of physical learning here is the role theory from this picture is to clarify the behavior of dysfunction because um as we see here the any inference method would be minimizing it or taking the patient posterior involves it just to remind ourselves that the posterior here is written in our preferred form is negative n so the likelihood that the log loss enters in that in that way okay so we first instead of studying ln we will study a kind of a centered version or a normalized version of it which is by subtracting its uh minimum where w naught is in the set of minima of um ln of the log loss or in the realizable case um this is um just um uh this is equal to uh the entropy of the truth okay um all right so immediately to to to study um this log loss function um or in our case to study the two study kn the normalized version we note that um just law of large number implies that um can will converge to if the mean exists converge to its mean right which is uh our familiar k function yeah um so there is um already a kind of expositional uh question here which is what happens in the unrealizable case i um so this this quantity here um this term here will just become p of x of w naught and that q of x remains uh in the exposition i'll go via that um [Music] okay right let me let's go to the next boat um the other thing that we can immediately see is using the central limit theorem central limit theorem says that um if you scale kn appropriately and centralize it separate kn centralize it with its mean and scalable root and this converge in distribution to a normal distribution um let me say that way first and and see what the what the variance of this distribution is well if you go through the central limit theorem calculation the variance is just the variance of um the the terms making out k uh kn which is this expression here so that's the term making up um k n and subtract x so that's that's the expression for the variance so and it's a it's a function of w um so what happens let's see what happens when we substitute in w naught or anything that makes uh p x of w not q of x so um the set of true parameters well we we get uh this thing becomes zero in that case
and that thing becomes zero in that case and um well the whole uh so that's zero and well that means that it's actually uh undefined um as a normal distribution at the uh in the set of true parameters um so this is really just um this convergence is a pointwise convergence so this is for each this only holds for each um w in the set away from the two parameters we can actually do better using um stochastic process theory i think there's a theorem called glyvanko cantileverum that makes this convergence uniform in w but still away from w naught away from the set of two parameters um okay let's before we go too far with um building this theory let's go and look at some illustrations of how this uh works in some specific example let's go to some of the plots ah let's go to the plots with a bunch of them the order is not coming okay cool right um i think uh yep i think everybody's here all right so um uh that's so so this is an illustration of a particular um kind of um model in this case it is uh just data generated from a normal distribution so um q of x oops a black so q of x um in this case is just um well it's just a normal with a particular mean uh with the true mean and true sigma and the the the class of model oops the set of model is also i'm committing the same kind of terminology thing as people usually did with like using model in multiple different ways but anyway what i wrote is what i meant that okay so um and the top part here the top series of images are the case where n is equal to 20. so the training symbol size is 20 um and each different image uh images each four four of these images are um different draw of the set of 20 training samples and what you see product here is the control of the likelihood function so and if you want if you mentally flip the meaning of the darker and lighter part this is um somewhat like the control of the log loss function but this is the this is the likelihood right and um the diamond here you see is the true parameter it's uh well it's w naught which is equal to mu not and sigma not in our case um okay so some observations right so throughout the four um different plots um it is clear that um the likelihood function fluctuates it changes with training sample this is what we meant by the likelihood function being a random variable well it is the measurable function of the of the training sample hence a random variable [Music] and so you can see that if i can use crosses to highlight the mle estimate if in each of these cases
and you can see that it is not exactly at the truth um but we do have if we look at the next plot oh sorry uh um sorry the top plot is n equal to 50. that's the that's the one with higher and you can see that because um [Music] because it has tighter distribution around the true parameter right and that's kind of the point i want to illustrate which is as n gets larger and larger the the fluctuation around the truth in this case in this particular case which is a regular model um it concentrates around the true parameter so these this picture illustrated the most likely parameter is not the true parameter but as n goes to infinity the true parameter becomes the most likely parameter and this property is called asymptotic consistency and furthermore as n becomes larger and larger the distribution becomes more and more like gaussian so um and this this property is called asymptotic normality and furthermore because of regularity we can we can prove that the fluctuation um around so uh the distribution of uh of parameters around the likelihood as around the around the true parameter is as tight as possible so so this this uh width is as tight as possible among all other possible um consistent estimators a consistent prediction of the true parameter and that's called that's a property called asymptotic efficiency and and these three these properties are very much related to so the asymptotic consistency normality and efficiency these are um very much related to the fact that in regular model we have um local quadratic approximation so k of w is locally approximated by a quadratic form in fact it's sorry w w transpose uh once you centralize um around the two parameters okay so um this is a regular model and now let's go to uh a singular model next example which is let's walk to the left okay so um this model i will write it somewhere on uh on the picture this model is a regression model of a 10 a 10 hedge regression right so so um the independent the dependent variable is y and the independent variable is x and a and b are the parameters and uh we're assuming that the residue is um this is a normal um is so the the noise of measurements of this process which is which uh which follow the law of y equal to a10 h bx um the measurement noise is distributed as a normal distribution almost with variance one one all right um so some observation right this is um if you look at uh the fact that there is a symmetry across across this line um there's a reflect uh mirror symmetry
across that line that's because um this equation here well the 10 is an odd function right so therefore it is symmetric under the transformation a b goes to negative a negative b um and this immediately says that this is not a regular model anymore it is not identifiable right um so so which means that these these distribution um is definitely not gaussian even asymptotically because it has um it has two mode right um so it has two peak even even asymptotically so it is definitely not a gaussian but um sorry again the square is the true parameter but locally around the true parameter um asymptotically you can if you just focus on that part the a gaussian approximation is uh is still um is still good right because locally um the the facial information matrix there or the k w function that is not um the hessian is not degenerate and therefore quadratic approximation near the true parameter in this case is still uh good so i should probably say that going from uh let's divide this picture up into two parts um going from the left to um picture to the right to picture the thing that's changed is that the the the training sample size um goes from 500 to 5 000 so you will see that the right hand uh distributions are tighter and going from top to bottom the true parameter moves closer and closer to the axes and on the bottom most plot the true parameter is on the axis well it's on the set a b equal to zero and what's special about the axis is that in this um uh in this model if you calculate the uh facial information matrix a b equals zero is precisely the set um where the determinant of the um fischer information matrix becomes zero or when it is where the fusion of information matrix fails to be positive definite those two are equivalent statements um uh here's a here's a expositional question for uh for people uh which is that we don't really have a name uh for this set um we we have a name w naught for the set of two parameters um and that coincides with um this set when the model is actually singular so be careful in this picture that everything above the bottom row everything above the the bottom row is actually not um [Music] it's actually not actually not regular model so it's actually have a positive definition information matrix near w0 except except that they are not identifiable hence they are not regular but the bottom row is the truly singular model and what distinguished those those two cases is that in in in the bottom loop bottom row uh not only do we have an identifiability
well that non-identification becomes the whole set right the whole a b equal to zero becomes so when a b equal to zero um the whole 10h term disappear aob because your the whole time hashtag disappear and this whole axes represent the same distribution that's not an identifiability but even worse is that near the true parameter uh near any of this parameter there there are no local quadratic approximation so even locally uh qualitative approximation or gaussian approximation to the posterior valves maybe there's one thing i think is helpful to remind people which is that uh i depends only on p on the model and not on the truth as opposed to k which depends on the truth so maybe you said that already but no no i haven't but that's that's exactly the fine yeah yeah so by by focusing on where the fischer information matrix is degenerate uh what edmund is doing is saying so two facts so remember that near a true parameter the kl divergence locally looks to second order like the fischer information matrix so what edmund is doing by showing you where the fischer information matrix is degenerate is saying if you place the true parameter there then k won't have locally a quadratic approximation exactly because the second order will look like this matrix and you've put a point where that matrix is degenerate so there's a um which is i guess sometimes we're often starting with k and thinking about k where you're sort of starting with the the model and the fischer information matrix and then thinking about different points for the true parameter maybe that's something that may not be immediately clear to people with that focus yeah that's uh thanks for clarifying that yeah in fact um i'm so this is kind of a either a theoretical question or uh it's just a exposition ease question which is um uh so so why is that the so so if we if we start with um the zero set of k and um uh and and uh and the exposition go via uh studying singular model which is uh where k is uh uh degenerate has this degenerate hashing um i think uh just uh uh people who are not familiar with um single learning theory might object that that's a set of measure zero and um why do we care right um uh but these these plots are meant to illustrate that even if your true parameter does not lie on that set where um the facial information matrix is degenerate um you're you are still massively impacted by the fact that there is a non-tribute set where the official information matrix is degenerate right so let's look at for example the behavior
out here is that even though um the true parameter is not on the set on the so i think we need a name for that set i'm going to call it the singular set or something uh for now it's it's not on a b equal to zero um but um at finite end even at um 5000 n uh a 5000 training sample um so we know that as if n is big enough it it's going to look like um uh uh looks like it looks like i think it is far far away yep sorry interrupt you i think we should call it something like fissure degeneracy locus or something like that if we just refer to it as the singular set since singular set of critical set usually refers to a function it'll be very tempting to think about it as being associated to k so i think it should be something like fischer degeneracy locus let's say bishop degeneres degeneracy locus okay degeneracy um right or be generously focused if i'm in a hurry right got it um okay so even if uh so um so this is this this may or may not be what what meant by um uh meant by uh so what is it called again a discovery process um which is um n becomes um really large um if your parameter is not on the degeneracy set um it's it's going to look like um it's going to look like uh uh locally it's going to look like a regular model but if if n is small then um uh then the the fluctuation of the log loss function or the posterior is going to big enough that it's going to have a very non-trivial mass non-trivial density on the degeneracy set in which case um these uh these parts so the uh the the point uh a b equal to zero zero the origin is um is a singular point of um the uh is a singular point of the fischer degeneracy locus and that's where the rrct is um is if that's that's why things are more singular and the rct is the largest weight uh i think that was a bit confusing the way you said it that point is on the fischer degeneracy locus and if the true parameter were there then the rlct would be lowest is that what you mean um when the when the true parameter is over where it is uh then i guess that's actually a regular model right so w0 is two points maybe uh yeah i think i'm trying to connect this story uh so this is um this might be not a not not a fully fleshed out story yet but um this this is very much like the case where um where we're talking about uh you know phase transition story which is uh which we still have uh uh non-zero l uh l at the the likelihood of um [Music] it's non-zero oh i see you're saying yeah that the degeneracy the official degeneracy locus
that is the union of the axes is not so that's some other level set but that level set has uh a lower rlct at that point and therefore exactly that phase can still interfere with the sort of true parameter which yeah okay yeah it can still interfere with the level set um of the transfer and if and and and clearly in so in these plots clearly some some sampling there is there is a non-trivial amount of probability that we sample a particular training set that favors um the the singularity more than it favors the actual true parameter uh that's interesting i hadn't thought to make this point using what we've been discussing lately yeah that's that's interesting do you mind if i put something on my personal board here and just illustrate this in case it's not clear to everybody so um so what's uh let's make this point actually the origin so this is w so that's the origin and then that's actually the true parameter um you want to label that on your diagram let's uh is that w zero uh well the true answer is a not b naught but yes w zero is fine yeah okay so i'll just put that on the plot so there's w0 and there's there's zero so that's w0 and edmund's pointing out that uh even though l favors the true parameter of course so we have something that looks like this maybe uh let's correct me if you think this is wrong but uh that's correct so it's been saying that ln should and does favor the true parameter uh but ln isn't in charge completely of where the posterior concentrates because there's also the question of the rlct so maybe i'll just completely i i think a bit more precisely i should say that it's l um l that favors um the favors w0 more than it does favors zero but ln which is the stochastic version um depends on what kind of um what what what data set is chosen right so there are times where we happen to choose a data set that for which ln actually favors um zero um and and that's that's influencing um well that's influencing the um the uh what is it the free energy right because the free energy is taking average over over the the training sample hence hence yeah but hence the fluctuation is very important yeah yeah but favoring the origin is kind of begging the question a little bit right so why does it favor the origin well uh okay there's uh this i suppose there's two contributing factors isn't it one is simply that it's stochastic so who knows maybe it ends up favoring it and the other is that it may be that even even for quite large n there's a regime
where it favors sampling near zero simply because the rlct here is much less and yep i don't know if it's clear in my head exactly but um suppose the rlct of that level set is much flatter as i've drawn it and you can imagine that even for mid-range n i don't know what that means you could still have samples that prefer to concentrate around a zero rather than around the true parameter um sort of consistently i mean i don't know how to disentangle that from like what other reason could there be for the samples to make they could just happen to be concentrated around zero whatever that means but there's that's one reason why they might do it yeah and and i guess the free energy formula is uh saying uh what happened on sorry the free energy formula itself is um it's it's not taking average over all possible um or possible data set but then there is a fluctuation term and i think um uh the the story here that it still remains a story without proof is that the fluctuation term is accounting for the fact that sometimes we happen to [Music] sample things that favors um the the singular case um and it and that that fluctuation the size of the fluctuation whether or not that takes over depends on the balance between um the yeah between the n ellington and the lambda logan term all right yeah yeah cool thanks this was a very illustrative discussion i think okay um cool cool right so uh okay let's go back to the board um and so i'm hoping that people are uh satisfactorily motivated to study uh the behavior of of kn uh to study behavior and now we are going to get into uh name formula one which is uh which gives us a handle um to the behavior of kn which is a normalized version of the lot loss so to study the behavior of that we introduce a stochastic process that is centered and normalized we call it psi so it is given by this formula so it's kn centered around its mean um so don't um the expression is not completed yet but this part remember um this part converges to the normal zero uh sigma of omega that so that that that's my central limit theorem but then we normalize it by this term and um this um i the only reason i i think this this is um not not the only reason but the main reason this is introduced i think is to make um the the variance um um at uh w not a constant uh so so that's um i need to go ahead and um and actually do the calculation and prove this but uh so recall that the the the numerator term um to a normal distribution with with some um variance that depends on w but um
if so in the proof of main formula one we will see that introducing this term will just divide that variance out and make it a constant at least on on at least on w0 on the set of two parameters actually it's in in the pullback of w0 by the resolution map um right but um also these terms also makes it um makes this process undefined where so so this is undefined at klw equal to zero which is um which is the set of two parameters okay so there is a exposition um question here which is what if we just don't work with the um does the normalized version what is what if we just work with ln and and then we don't have this divided by zero problem but i suspect the problem will arise in the um in the variance uh as well um so this this makes me uh this is makes it clear that we are traversing a well-lit path and all the different branches has been like unproductive branches have been culled um i'm not sure yeah yeah i don't know yeah so um this yeah anyway um let's traverse this path for now and let's uh let's let's look at what this gives us um if we uh rewrite this um and see what how this massage kn into a convenient form we get this expression right so so these are all um these terms are all deterministic terms that and this term handles all the person stochasticity and um if we recall the um the proof uh the attendant proof of main formula two from last week this form is to be used um in the main formula two in formula two and just to remind people what that looks like um so the main formula two is that we get we we want an asymptotic expression for this integral okay dx device uh sorry for the pull about presentation but um and if you compare so so um the resolution of singularity is going to give us um this term here and the square root of that is going to give us this term so um and then that resolved into the c of x y um and then if we can prove that um if you can prove uh the main formula one and we can substitute it back into the um the current asymptotic we have for male familiar two and get uh uh and and get uh uh the the asymptotics for the free energy and get a free energy formula right so so this is one motivation for proving main formula one um right so we have that psi n converges to psi um so this uh normal this this central limit theorem convergence point wise convergence to a normal distribution can be strengthened into a convergence of psi to a gaussian process but only on only away from the set of true parameters however because physical learning theory is
mainly about discovering the truth we do need to know about the behavior of this process of this fluctuation near the set of true parameters okay okay let's go to the next spot to illustrate the main idea of doing that so the main idea to put it a bit colloquially is to use resolution of singularity the theorem from algebra geometry that is our favorite to colloquially factor out zeros of kfw this is in the same way we factor out zeros from holomorphic function for example sine we know that sine that on that um conv the limit of sunset on z as that goes to zero is one and that's saying that signs that is equal to z times g of z where g of that is holomorphic and non-zero non-zero locally near the origin um and still holomorphic we are going to do a similar operation to to this except we have we have multiple variables and we have a very complicated um non-isolated set of the zeros right so let's write down the let's write a statement for main formula 1 and see how this is related to the idea of vectoring out zeros so here oops that'll make kenneth happy he often does that [Laughter] that's not just me yep right uh so given a model p of x w um with w in a space of parameter which is subset of outer d and um uh and sorry i think in the proof it is needed that this is contained in an open set which is contained in r2d well it itself is compact and has a open neighborhood truth q of x which is realizable x given w naught for some w naught in the set of two parameters um in the zero set of k of w um satisfying [Music] some uh i'm going to be a cop out and say mouth condition um so this is a massive asterisk here um then there exists epsilon positive real number such that the epsilon neighborhood um of the set of true parameters and i think this is defined inside the open set um um then this set this excellent neighborhood can be resolved by a resolution map so that in every local chat um so what is that local chat k uh composed g um is uh this normal crossing uh for sum k uh so for some uh multi index okay um n so this is where the zero get vector out and using the same um resolution chart f of x w sorry i haven't definitely accepted haven't i so f of x w is just the log likelihood ratio it's that term and notice that f of x w q of x the mean over the truth is equal is the k function that we want so using the same resolution map this term becomes we can factor out um the zeros and that is uh uh that is a ls cubed uh valued uh real analytic function concretely that means that
it has we have a power series expansion um with coefficient sorry this x u to the alpha so power series expansion in u and with coefficient in um in lsq uh lsq is the space of um non-s integrable function um right um and then finally using this property we can define uh we can get um an expression for k n in the block in the resolved space result parameter space as u to the 2 k then u to the k c you um where where this stochastic part is defined everywhere on uh on this blow up space and is defined as is defined using just the factored part of f is equal to that expression okay so let's go back to the first part and we will just finish up with an outline of the proof with the steps that's needed for the proof of main formula one um so unknown proof is to first um just get the resolution of singularity resolution of singularity for k and this is um uh wait at uh uh at the absolute neighborhood um of w node um which we call w epsilon uh just now this is guaranteed by veronica resolution of singularity and its extent and the diversion for analytic function right and and this gives us the uh the map the resolution map such that k g of u in every local chart is is normal crossing right step two um and that's kind of the core step well beyond the resolution and its theorem 6.1 in the gradebook is that we show that this expression let's define ax u to be equal to the expression and show that this expression is um analytic right um and a crucial crucial step in here is by realizing that this expression is is equal to k g of u which is equal to u to the 2k right and once we have that um we will just define the random process we have before sorry these are some of us and so this is this can be taken as a definition once we have a and observe that if i multiply this um multiply it by using the k root then so the root n disappears and i get u to the 2k u to the k uh well that's just f of x i of u and that's that's k g of u so so this is equal to um n lots of kg of you because of the summation uh minus um and lots of um uh the kn function right because um that's the sum of um 10 is the average of the fxiu function all right um and then we need to show that this random process um is defined everywhere and everywhere on the blog manifold and the crucial step in that is to show that um a x i'm going to write u naught um the second moment is uh is if 2 is a constant right and u naught is in the pullback of the set of view our true parameter um okay so that's that's the outline of the
proof that we will go through uh next time and there are some a few to do which is uh the main thing is to prove main theorem two um next time and also to illustrate where the mild conditions of of this proof comes in it's called fundamental condition one in the grade book it involves extending um this f uh function to uh to co to the complex plane um i am still not very clear why that is strictly necessary um and yeah and then finally we do need to invoke some um uh results from empire empirical process theory or just stochastic process theory to to show that these things do in fact converge to a gaussian process and that's um that completely characterized um like the behavior of the determinant the deterministic behavior as well as the uh stochastic behavior of kn um in a neighborhood around the trooper and right that's it for today um uh and any question um yeah i guess i also don't know the best way to communicate what's going on with uh with c here and what's your intuition about dividing by the square root of k i mean i guess i understand where it comes from and say the central limit theorem for example um yeah well the um i actually cannot so uh in the central limit theorem we can get an expression for for for the for the variance right in terms of f and and uh and the parameter but i i don't think that the square root of um i don't think the standard deviation is exactly square root of um k um maybe that is true in the set of two parameters and that's that's the only reason he introduced it that way because um we we really just care about um what happened on this uh we just want one c to be defined on that side um everywhere else we are we are fine um so that might be the reason and also the the place where it show up is is basically um in this in this integral here right that's um that's where that's that's where we uh divide it out and exactly make uh the variance constant on the pullback of the set of drum triangles um can you say that again what in what sense have we made the variance constant oh because that yeah yeah we don't we don't want it to have um to have dependence on u or w anymore um for uh one it may makes it very clear that um this stochastic process is defined uh on u naught um and and hence on w naught um uh yeah i think it's kind of a standardization um term but only for uh w naught and u naught that's that's my intuition at the moment that's my guess at the moment that's the only reason yeah i think i don't really even know how to
so if i set myself the exercise of explaining to a deep learning person what c means in terms of trajectories or something i think i wouldn't be able to do a good job of it i can technically understand the role it plays but um you know like k n minus k is something you could explain to somebody right in terms of you know it's uh yeah that's kind of the first exposition question i ask like why like even that is a bit suspect which is like why like i think a deep learning person might ask why don't just work with l right let's work with the log laws that is that is the measurable thing and um that is the thing that we can get from experiments and uh they only like in their experiments and and most of the quantity they compute they only use l um and as i say like introducing k gives us the problem of dividing by a zero when we introduce the root k of k of w and and why why did we do that [Music] um well uh let's see hmm yeah fair enough i guess again if you're working near a true parameter k is the thing that sort of just has the second order part from the taylor series expansion right so [Music] i think it makes sense to talk about k in local coordinates rather than l um [Music] right just sort of getting rid of the zeroth part of the taylor series expansion right i mean that's that's a normal thing you do when you're studying functions right when you talk about singularities of a function usually you just shift the function so that it's zero at the origin that's so that's all i thought about the focus on k um definitely uh i uh i think i think even divinity people who um sort of engineering people would accept um that focus on k um the the thing that i'm not um sure is that it it it's actually the definition of c oh sorry offside um [Music] because like uh we um the the the denominator there um so why not just take uh you know ln minus l um and then and then normalize that somehow because l is never zero um uh is the neighbor zero actually we don't know that actually l can be zero yeah in fact l can be uh cannot be negative wait no no no no i'll come here yeah right yeah even l and can't be negative right because those are all positive terms because those are all probabilities [Music] wait that that does mean that ln is never zero and l is never zero it's a thumbnail it doesn't strike me as the main thing to worry about um i guess i'm thinking about in the regular case when you when you look at when you look at the what psi is in the regular case you somehow it makes a lot
of sense to divide through by u or square root k because the what you're looking at is a kind of dot product and then you're just getting the actual interesting part by dividing through by you i i couldn't i couldn't figure out a so in the regular case it's very clear that's sensible and what that means in a singular case i think i still don't really have a good intuition for well anyway um i i should i should do an exercise with um with a regular is there a singular case for that uh yeah right there can be a singular case for a one dimensional uh model right um so uh yeah and yeah and then see what happens like we don't have this multivariable um things to be concerned about and just contrast the regular and singular case in that case and see yeah and and see where the square root um comes in yeah even just with a cubic or something um cubic well have a have an exponential model where you're just looking at the graph of x goes to x cubed rather than x goes to x squared which would be right which would be regular well thanks a lot evan that was that was really useful uh no worries um thanks for uploading the pictures uh yeah oh you didn't you didn't show us that other picture it's meant to be uh where where i show uh uh it's meant to be where i show the uh the idea of factoring out the zeros um but i i i meant to like write things on it but yeah that's okay yeah uh uh wait where's the blow up oh blah is on the other side um i meant to show that like uh let's walk to the outside um i meant to show that this part like so that's the the circle part is meant to be kind of a chart um a local chart sort of uh the preferred local chart um that we are we will be using um after blowing up and and it will it will look like um uh normal crossing like that and in fact um it's further further divided up into these charts yeah so in n dimension is divided out into two to the n each each each one each neighborhood of a singularity is divided up into two to the n um of those um uh yeah and and that's uh ultimately what uh what will happen in the sort of the center part of the free energy formula which is sum over the other star charts and and these are uh where when things are more singular yeah that's that's why i meant to show that cool yeah i think this picture is a really good one for illustrating this um choice of adapted coordinates or whatever it is calls them yeah wait it's not what is it adapt it couldn't it oh that's just that's just a standard term that gets used for choosing coordinates that are