WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
Bayesian Neural Networks are a powerful ensemble of networks that produce a distribution of predictions, allowing for robustness and error estimates. Variational Inference is a technique used to approximate a posterior distribution and Normalizing Flow is a generative deep learning architecture for transforming simple distributions into complex ones. Experiments comparing a Gaussian and Generalized Gamma base distribution in a push forward neural network revealed improved performance and lower Lambda vfe. In singular cases, the local approximation around the point will never be Gaussian.
Bayesian Neural Networks (BNNs) are an ensemble of networks with their own weights, rather than a single point estimation for an output. This produces a distribution of predictions rather than a single point, and allows for robustness and error estimates to be taken into account. Traditional BNNs are not scalable due to intractable integrals, but deep ensembling and MCMC methods can be used. Variational inference is an alternative method which estimates a probability density by averaging across high posterior likelihood guesses of the model parameters.
Variational Inference is a technique used to approximate a distribution by finding a member of a variational family that minimises the divergence between the true distribution and the variational family. This involves designing a good family of models and optimizing the variational parameters to minimize the KL Divergence between the family and the posterior. Evidence Lower Bound is a related quantity which is a lower bound to the actual evidence, and takes the expectation over the approximating distribution.
Variational Calculus is a method used to approximate intractable posterior distributions by maximizing a free energy term with Gradient Descent. It involves manipulating the free energy formula to get an expression that can be minimized, and minimizing the KL Divergence to maximize the quantity. Regularity in a model means that there is a unique parameter in the parameter space that minimizes the difference between the true distribution and the parameterized distribution. Singular Learning Theory states that for any true distribution, there is a unique parameter that minimizes the difference.
A regression model with parameters a and b is plotted in a diagram showing the posterior distribution of the control part. The true distribution that generates the data has a singularity at the point W0, making it non-identifiable. Even with a large data set size, the posterior distribution is never gaussian. Variational problems for singular posterior are more complex than gaussian approximations, and quadratic ball expansions and Taylor expansions are used to approximate the posterior. Increasing the data set size causes the posterior to become more concentrated around the true parameter, but the posterior is never gaussian.
Singular Learning Theory states that the free energy of a neural network has an asymptotic expansion with a multiplicity term, and that the expectation of the generalization error is equal to the increase in free energy when using the full posterior. Variational methods are used to approximate posterior distributions in order to make predictions, but the coefficient of the variation of the energy (Lambda vfe) and the coefficient of the generalization error (Lambda BGE) are not always equal. Lambda BGE can be found to be smaller, greater, or in the middle of Lambda vfe, where the approximation of the posterior is not as good but the generalization error is not affected.
Variational Inference is a method of approximating a Bayesian posterior and can lead to better prediction. It is based on an operational family and the generalization gap between the accuracy of the variational approximation and that of the true distribution can vary. To assess the accuracy of the approximate posterior, MCMC samples can be drawn and compared to the ground truth. Test log likelihood is often used to measure generalization error in deep learning, but there is no clear relationship between the optimisation criterion and generalization error. Other divergences can also be used to measure the gap between the true and approximate posterior.
Normalizing Flow is a generative deep learning architecture which uses a resolution function to transform a simple distribution into a complex one. This resolution function is an atomorphism away from measure zero sets and is invertible almost everywhere. It is used to learn a function G which pulls back a family of variational distributions with parameters such as beta, K and RGF, and pushes it forward to a new family of variational distributions. SLT states that a k-n function has a standard form which can be obtained by applying resolution of Singularity and a resolution map. This provides a way to explain how learning theories can be used to create a standard form of a k-n function.
Experiments were conducted to compare the performance of a Gaussian base distribution and a Generalized Gamma base distribution when used in a push forward neural network. Results showed that the Generalized Gamma resulted in a performance boost and lower Lambda vfe. However, when it came to the generalization landscape, there were some tricky parts. Experiments also showed that a green network had a better approximation to the posterior, but this did not necessarily mean it was better at prediction. In situations where the true distribution is singular, the local approximation around the point will never be Gaussian.
Edmond is presenting a joint work with Susan on a Bayesian Neural Network. This type of neural network uses an ensemble of networks with their own weights, rather than a fixed architecture with weights to be learned. This means that instead of a single point estimation for an output based on an input, an entire distribution of predictions is outputted. The aim is to create a co-operational BNN via resolution of similarities.
Patient Neural Networks are more informative than point estimates as each neural network's output is weighted by the posterior weight. This posterior is determined by the data, so incorrect networks will have low posterior weights and contribute little to the overall result. Robustness is improved as the model selection is built in, and the model is less likely to overfit. Additionally, error estimates on the output value allow for the model to be aware of its confidence, and not confidently wrong.
Bayesian neural networks have been around for a long time, but are not scalable due to their intractable integrals. Deep ensembling and MCMC methods are better, but still struggle with high dimensions. Variational inference is an alternative method which estimates a probability density by averaging across high posterior likelihood guesses of the model parameters, obtained by training the neural network with multiple different initializations.
Variational Inference is a technique used to approximate a distribution, usually the posterior, by finding a member of a variational family that minimises the divergence between the true distribution and the variational family. This family must have easy to compute likelihoods, be easy to sample from and optimisable using first order methods such as SGD. This enables predictions, uncertainty estimates and averaging to be done with the variational family instead of the true distribution.
Optimizing variational parameters in order to minimize the KL Divergence between an approximating family and the posterior distribution is equivalent to maximizing the log-likelihood of the posterior. This involves designing a good family of models and finding the balance between easy to compute and minimizing the distance. This setup has theoretical consequences that need to be considered when using it for neural networks.
Evidence Lower Bound is a quantity related to the posterior distribution which is written in the form of 1 over Z and e to the negative nk n W. It is also known as the variational free energy and is a lower bound to the actual evidence. This is because the variational free energy is always bigger than the actual free energy when using the actual posterior distribution. This optimization objective involves taking the expectation over the approximating distribution.
Variational Calculus is a method of approximating intractable posterior distributions. It is used to maximize a free energy term by Gradient Descent. This is done by estimating the derivative of the current point and approximating the expectation by taking samples from the current distribution. The free energy formula is manipulated to get an expression that can be minimized to maximize the quantity. The KL Divergence is minimized to maximize this quantity, and if a distribution is found that is equal to the actual distribution, the KL Divergence is zero and the free energy is equal to the evidence. This is known as the variational gap and is a function of the family of distributions used for the approximation.
Regularity in a model means that there exists a unique parameter in the parameter space such that the difference between the true distribution and the parameterized distribution is minimized. Variational calculus involves differentiating a functional integral with respect to a function. The KL divergence is constructed to have the property that for any distribution other than the posterior, the quantity calculated is greater than the free energy. Singular Learning Theory states that for any true distribution, there exists a unique parameter that minimizes the difference between the true distribution and the parameterized distribution.
A regression model with parameters a and b is plotted in a diagram showing the posterior distribution of the control part. The model is y = 18hx + gaussian noise bx. The regular model implies that the maximum likelihood estimator will converge to the actual minimum, and that a gaussian family should be used to match the dimension covariance. The singular model is more different, with the set of similar points being the coordinate axis.
The true distribution that generates the data has a singularity at the point W0, meaning it is not identifiable. Even with a data set size of 5000, the posterior distribution is never gaussian. The maximum likelihood is away from the true parameter, causing the posterior distribution to be very degenerate. Low end, the axes is a reasonable approximation to the data, due to the hypothesis being singular and having larger functional fluctuations. This allows the hypothesis to cover more possibilities than a non-singular one.
Variational problems for singular posterior are more complex than gaussian approximations. As n increases, the posterior becomes more concentrated around the true parameter and the gaussian approximation matches the posterior. Quadratic ball expansions and Taylor expansions are used to approximate the posterior, and the sharper the approximation, the better. The posterior is never gaussian, even with the effects of increasing n.
Singular Learning Theory states that the free energy of a neural network has an asymptotic expansion with a multiplicity term. Generalization error is the difference between the true distribution and the distribution of predictions made using full Bayesian posterior. This relationship implies that the expectation of the generalization error is equal to the increase in free energy. This holds true only when using the full posterior and not any approximation.
Variational methods are used to approximate posterior distributions in order to make predictions. However, the relationship between the variation of the energy and the generalization error is not always the same. The coefficient of the variation of the energy, Lambda vfe, is not always equal to the coefficient of the generalization error, Lambda BGE. This can be represented by a cartoon. Lambda BGE can be found in one of three places: smaller than Lambda vfe, greater than Lambda vfe, or in the middle. If Lambda BGE is in the middle, then the approximation of the posterior is not as good, but the generalization error is not affected.
Variational inference is a method of approximating a Bayesian posterior, and it can lead to better prediction. Depending on the variational family used, the generalization gap (the difference between the accuracy of the variational approximation and that of the true distribution) can vary. Generally, people don't distinguish between the two types of effects, but Lambda vfv minus Lambda is the accuracy of the variational approximation and is asymptotically equal to the generalization error.
Variational Inference is a method of approximating a posterior distribution based on an operational family. However, there is no clear relationship between generalization error and the free energy, and the only reliable way to assess the accuracy of the approximate posterior is to draw MCMC samples and compare them to the ground truth. This is not ideal as it takes a lot of time and resources.
Variational inference is a practice of optimizing using one criterion and assessing using another. There is a big gap between these two criteria and understanding this relationship is important. Test log likelihood is often used to measure generalization error in deep learning, but it is not clear that the optimisation criterion has any control over this. There are no tasks in which controlling the gap between the true and approximate posterior will control the mean or variance approximation. Everything presented here is for the Kullback-Leibler divergence, but there are other divergences to use.
SLT states that a k-n function has a standard form, which can be obtained by applying resolution of Singularity and a resolution map. This resolution map is used to clean the function and get kg. This theorem provides a way to explain how learning theories can be used to create a standard form of a k-n function.
This family of variational distributions is designed to approximate the posterior of a normal crossing form. It uses a mean-to-zero factorization, where each component is a generalized gamma distribution. Computing the normalizing factor requires a truncated gamma integral. This family is easy to optimize and sample from, and the variational free energy is determined by the ROCT.
Normalizing Flow is a generative deep learning architecture which transforms a simple distribution to a complex distribution using a neural network. This neural network, known as a resolution function, is an atomorphism away from measure zero sets and is invertible almost everywhere. It is used to learn a function G which pulls back an old family of variational distributions, with parameters such as beta, K and RGF, and pushes it forward to a new family of variational distributions.
A paper proposes using a generalized gamma distribution as a base distribution to learn a complex distribution instead of the usual gaussian base distribution. Experiments were conducted to compare the performance of different base distributions and the expressivity of the push forward neural network. Results showed that this architecture was able to transform and optimize the distribution in an easy way.
A comparison was made between two base distributions, a Gaussian base distribution and a Generalized Gamma base distribution, to see if the latter would result in a performance boost. The experiment was set up so that the neural network was not allowed to be too expressive, to isolate the effect of the base distribution. The results showed that the Generalized Gamma did result in a boost, with a lower Lambda vfe. However, when it came to the generalization landscape, there were some tricky parts, as explained by Edmund.
The speaker discussed the results of experiments comparing different networks which attempted to mimic a resolution map. The experiments showed that the green network had a better approximation to the posterior, but this did not necessarily mean it was better at prediction. For regular models, the posterior converges to a Gaussian as n increases. However, in situations where the true distribution is singular, the local approximation around the point will never be Gaussian.
Gaussian approximation can be used to approximate a surface near a critical point. As n increases, the variance of the posterior decreases, converging to a fixed value. This is similar to the Central Limit Theorem, which states that a normal approximation can be used to approximate a binomial distribution as the number of trials increases. As n increases, the binomial distribution becomes concentrated around the mean probability and the normal approximation becomes very sharp.
Neural networks can be used to approximate resolution maps, which are difficult to access and understand. The network is designed to be expressive enough that it can learn the resolution map, and the guarantees inherent in the resolution map will be inherited by the approximation. There is no unique resolution map, however, so the network must be designed carefully to ensure it is restrictive enough to approximate the map.
So today we're going to have Edmond tell us there it is um about some recent work that he did in collaboration with Susan way who is my colleague here at the University of Melbourne and uh perhaps we'll be joining us a little bit later and we'll see uh this one yeah thank you cool okay yeah uh please do yeah okay cool um thanks for coming everyone so um as Dan said I'm going to present um a joint work with Susan um she's the main investigator I'm just I'm helping and um here's me presenting the work and please do ask me questions if there is anything that is not clear I only present sort of the main ideas and I'll go into details and ask request them okay um so the the title of the work um is our co-operational Beijing neural network via resolution of similarities okay so um there's a bit a bit to unpack here um and I will just run down some quick um background before we start so um a quick word about um version neural network or bnns so like usual neural network as we are probably very familiar now and I'm just going to presented by a cartoon um usually what happened is we we have an uh we have a fixed a neural network architecture and um and there is that there are weights and biases um um on on these at work I'll put on that Network to be learned and the usual learning method is to find a weight that minimize some loss function over evaluated over the data set LW is the loss function and you have x i y i the data set um oh sorry sometimes plus a regularization term so this is argument over w um for Bayesian neural network what we do instead is so we look at a huge ensemble dot dot of networks um each of them having sort of their own weights hey and then so what happened in the usual network is that we have a point estimation coming from optimized um some optimization procedure and the prediction is done via so if you have an input input X and then you want to know what's the corresponding output um why that is done by just um evaluating X over that Network but in Bayesian neural network what happened is we have um ensembled prediction so what that means is that instead so given given some input X we want to know what y equal responds to but we actually um output sort of uh an entire distribution of prediction over y that comes from doing some kind of on some ensembling ensembling over what over the posterior so it is um it is saying so if every every one of these weights represent a neural network and a function from X to Y and those ones according to the data have some kind of
loss um have some kind of loss and then we are oops this is w so the distribution of a y is um some sort is is average or um is average over all possible neural network with that fixed architecture um but each the the answer of each neural network is weighted by the posterior weight and this posterior of course depends on the data so what happened is hopefully um the the network that that is just very very wrong uh would have a very very low posterior week and therefore contribute very little to um to to the outside answer um so let's discuss a little bit what um why would probably would want to do that what are the differences well um the point is kind of that this thing has a lot more information well it has a lot more information about what the possible uh why is correspond to um corresponder input X um compared to a point estimate um so in some sense we uh the the patient neural network will um it's good here it's not great there for the usual neural network um it's what about robustness um quote unquote certain the usual narrative is that um patient neural network is a bus whereas um the usual SGD train network is not and what does robust robustness means it kind of means um that it's kind of a colloquial term that sort of range over a lot of different concepts so it might it might mean that it is not um it it doesn't overfit um as easily so um so in huge Network in very of a parameterized network if you seek a point estimate by minimizing some loss function there is a chance that you would just um overfit through the training data instead of finding a past and bonus one whereas their their work that implies that um at least in with some conditions that if you do this kind of Bayesian learning then there is some sort of built-in model selection or a built-in bias towards a simpler model where simpler is measured in some specific ways related to the concept of entropy or model complexity and the other kind of robustness that people sometimes talk about is that um that I mean even just having error estimates on your why on on this um on this output value y sometimes it means that the model won't be sort of confidently wrong so in in the usual neural network you only get one estimate why and you have no idea whether or not the model is saying that is saying that as a tentative answer or it's it's just very confident whereas here you could you could know that uh what are the other possible wise and is the neural network particularly set on giving this answer
well there is a massive downside to the Bayesian um oops the spell that uh that's the massive downside so the reason based on your network well Bayesian method has um has been around four forever um but Bayesian neural network is uh still uh uh not a mainstream method because this is not it is not sex scalable um the problem is that this integral of both um or this integral this integral um is not tractable if you want to produce this estimate you it's it's not trackable for high dimension so by high dimension is so if you want to do numerical integration something like five or six uh Dimensions or neural network of just one or two uh no it's already hard while the alternative is um uh Michael chain Monte Carlo to uh to do this mcmc integration but that is also not tractable for the the the modern Network size um which can go from hundreds of thousands to Millions to billions to trillions or uh ncmc sort of well it's better than doing numerical integral but something like 10 to 20 Dimensions it's already starting to struggle right so okay so what are the remedies then so we have a problem we have a problem so P of w this this distribution while I'm running density but really it's a distribution that we care about um integrating against this distribution finding uh finding its mean or finding um its spread Etc is intractable um and difficult to sample well what possible remedies do we have so one method um for doing similar things to neural networks sort of um method that that is Bayesian in spirit um would be something like deep ensembling debunk something method so the idea is that we um we we acknowledge that we are not doing full posterior averaging but we're doing some sort of averaging where um where we are averaging across um guesses of the model parameter that that are High posterior likelihood well usually what um what that means is that people will train the neural network for with multiple different initialization with different with different a lot of things right different learning rate different a lot of things so that it is um some sort of sampling from from something maybe not the posterior maybe the posterior and then when we want to generate an inference y for given input X you just pushed X through all those networks and then take the average um what we want to focus on today is another family of method called um variational inference method and this is our Focus today so what um so to set up the problem in general we want to estimate some probably probability density P of w
so in our case this pow will just be in posterior what we do interactional inference is we propose a variational family dinar cue whoops which so this family is a family of distribution Q is a distribution over the the model weight the neural network weights um indexed by some um index by some um its own parameters so now we have we have two kinds of parameters now we have the neural network parameters and we have also have Theta which is the which is known as the variational parameters um I also apologize that in this manner we usually use Q to be the true distribution it is not in this context Q is the version of Emily so what does the first version of family do we want to find some member of the variational family call it um data star that minimize the quote unquote distance to the true to to to to key to that distribution that we are approximating and this distance well it's not quite a distance but it's a clear Divergence so we want to minimize the Divergence between Q um and whatever distribution we are approximating okay and finally here you have in mind that this PW would be for example the posterior from some sample right correct so this I'm sending out this in general but in our problem pow is the posterior and finally we replace um anything we want to do with key of w with to replace um averaging for example so things we want to do with W things we want to do with P is for example making predictions um averaging over it or making uncertainty uh estimated and so on but I'm producing averaging over P of W uh by Q star um of w which is I'm just shorthanding that as the um the member of the variational family that minimize the distance um to PFW so it's just a pictorial picture maybe we have this um this this version of family um and then there is this Cube star and um p is usually um outside of this family um why is that well what why do we why do we want to set up this setup is is would be uh would be useless if there is no additional property we demand of this of this family and the things that we demand of every queue in here is that so we need every distribution um in this family to have easy to compute um easy to compute likelihood or more likelihood um easy to compute queue of data in other words we also want easy to sample sample from them and usually we also demand that um this expression here can be optimized using first order methods so for example optimize using SGD okay so the first the first two the first first two demand is very uh con is very restrictive because
most of the time [Music] a lot of computationally effective um family is uh actually a lot of natural occurring posterior distribution and naturally naturally occurring High dimensional distribution uh that does not have those kind of properties so um this field becomes one that devote itself to designing a good family designing good optimization methods and um and trying to figure out um the kind of balance we to to strike between uh easy to compute and and minimizing this distance basically minimizing the um decent spinky and the best in queue he said it already about the question yeah maybe I'll just spell it out even more clearly since the notation is very similar to what we usually have but somehow very different so usually the KL Divergence is between also q and P but the integral is over X rather than here be uh over W so there's and I guess in that picture often we're drawing like the true distribution being close to some model but he appears not the model the same letter but it's standing for the posterior so we think about the model as being a distribution over X and then we're trying to get it close to the truth which is also a distribution over X but here it's well it's kind of X has become W in some sense so now we have distributions over weights and we're trying to um trying to match one by by some family of now the Q is kind of the class of models in some sense yeah I'm trying to match the um notation of the paper there's another thing that I'm sleeping under the rock well I will explain in a moment but um notice that in the usual notation we have the thing that the the the the true thing um on this side of the Chao Divergence and that's the approximating thing on the other side um uh the reason for that will be apparent um later on and Care leverages is not symmetric so that actually makes a difference okay so um let's discuss some theoretical property of this setup just some really high level um theoretical consequences that needs to be considered when using this using the original inference so um the optimization problem is the following uh we want to minimize over the variational parameters the power versions um between the approximating family and the posterior distribution so I'm substituting P of w to be the posterior now because that's what we are concerned about invasion of neural networks and actually this is um equivalent all right equivalent um okay I'm just going to do that this is equivalent this this optimization problem is equivalent to maximizing
um over a quantity called elbow so that is called evidence lower bound explain the name in only so what is this elbow quantity is equal to actually let me let me explain name first so evidence should be familiar to a lot of us so recall that the posterior distribution can be written in the following form which is one over Z and e to the negative n k n um W where now all the all the notation here are standard running Theory notation there is I forgot a prior um KN um is equal to negative 1 on n okay P true the true distribution now is the tradition of x um over um P model right um and ZN is the um is the model evidence that we are familiar with I'm going to work with the the normalized version of things so I'm going to bother making that distinction and so this quantity called evidence lower bound is related to this evidence um but really um given our contact of single learning theory I think it's a little bit of misnomer because we call that we have another quantity called the free energy which is just a negative log transformation of the evidence which is kind of a more theoretically easy quantity to deal with and the reason I think this is normal is because this elbow is equal to the um okay I'm going to write out the expression because that's important right so it is equal to this this expression which we will Define as the variational free energy but this version of the free energy is well uh it's not like we have um instead of having a fixed posterior distribution this variation of your energy depends on which approximation we are evaluating it on Okay so this quantity is called evidence lower bound because sorry if negative this free energy so um the expression here is just saying that the things that we are evaluating in here is um e to the um uh e to the evidence um e to the negative uh free energy which is the evidence but for the variation of free energy various professional family and the um uh standard theorem from variational calculus would tell us that well this free energy is just always going to be bigger than the actual free energy if you um if you use the actual posterior distribution and therefore implicitly this is like the evidence um VB uh yeah so that the elbow is a lower bound to the uh to the actual evidence yes okay the expression of this um of this optimization objective is also interesting because it involves this term and this is why we sort this to position which is because we want to take um expectation over the approximating distribution
instead of the intractable posterior distribution so to do optimization of a um to to do the optimization to maximize this elbow term so usually we do great in descent and at each style of gradient descent we need to we need to estimate this um we need to estimate this quantity this elbow quantity and to estimate the derivative at the current um point which means that we do that by Computing this expectation which is approximated approximated by taking samples from taking samples from the current Q um Theta t and that is by construction uh well but it's designed to be easy question yeah I have two so I I guess I can think about if I look at the free energy formula and it's a logarithm of an integral and I could just be brave and move the logarithm under the integral sign and then I get something that looks like what you've written there okay so what uh what justifies that of course it's not equal but and secondly uh I don't see immediately why minimizing this KL Divergence is the same as maximizing this quantity right oh uh so ah okay sorry that's uh that's just okay um that is a borrowing calculation from from that to that do you want me to repeat that calculation no it's okay if you just say it's some calculation that's fun so yeah it's it's really just right out write out the expression and then um noticing that there are there are terms that doesn't depend on Theta and drop that and that's your elbow that's it yeah um and and also another important thing to mention in regards to your question is that um version calculus not only give us this inequality but it also say that if we manage to find a q star that is equal to um P um then then uh then well if we go back straight to the original definition uh then the care Divergence is zero and hence um uh F and DB is equal to FN and the elbow is actually equal to the to the evidence so that um it at least have the what is that the reflexivity reflexibility of a property of the metric even though it is not a metric and this quantity um called this this gap between um the approximating the the free energy of the approximating distribution and the actual free energy well it's called The variational Gap um it's a function of n well it's actually a function of N and which which member of the family that we are using for the approximation but we just we we make it a property of the entire family by taking the the optimal distribution okay so so this is the command just now is just saying that this Gap is equal to zero if there exists Theta star such
that Q star is equal to p right okay uh now let's go back to um the first board because now I want to talk about actually uh let me let me have a question on that board I'm not ready okay um yeah yeah assistant the sense in which um look if f b b is greater than or equal to FN um you said because of the standard result in variation of calculus um when I think of variational calculus I'm thinking sort of um you know a functional integral um an integral that's the function of some functional form if you like and differentiating with respects to that function perform and this is very vague um but anyway uh so am I to think of not fixing the family of q but actually varying over all possible cues in that inequality uh yeah yeah you're right so so um I I shouldn't even like in in the setup that I mentioned I shouldn't even mention um calculus of variation um uh because I'm actually proposing a finite dimensional family parameters by Theta so it's it's really um but but you are right the um the the general statement there is that for any other distribution that is not equal to the posterior this quantity that you calculate is just going to be uh greater than um FM other than the actual free energy I see so so it's uh it it yeah so it it is so if we go back to just the the original KL Divergence um uh definition um it's kind of the by construction of the Chao Divergence and sort of care Divergence is constructed to have that and uh to have that property so this this quote-unquote theorem is really uh sort of it's only a theorem and not a tautology if if it is about the evidence lower bound okay thank you nice um next I want to talk a little bit about um this where singular Learning Theory comes in so regular as a singular all right let me talk first about um Regular what happens in this setup for regular by just recalling some um theorems that we know so um for regular model um what do we know we know that um we know that for um for any true distribution um right in full sentence um for regular model for any true distribution feature okay by any that there's some restriction no reg movements and stuff like that um for any true distribution that exists um that exists unique there exists a unique um W node in our parameter space no people down notation um such that um the detail between P true and P of X at that parameter is uh minimized and regularity almost by definition means um that um well it means that there there exists it's um this this Global Minima is uh is
is the most critical point and therefore has uh uh coordinate in which it is um in which the local activity sum square and that imply um that imply that the posterior converge as n goes to Infinity to a gaussian convergent distribution to a gaussian Center around that point so um so this um it's it's kind of a powerful statement that say that this is sort of um a demonstration of how restrictive a regular model um is is that just for any um phenomena out there that you want to study and if you're using a regular model to study it then um okay not just not just everywhere not just at w0 at the global minimum just any minimum that you're studying that you will have a gaussian center center around it Center center around that um local menu um which means that if you want to do variational approximation of of such a posterior well just use a gaussian family um use the calcium family and and match uh um and also the regular also implied that the the maximum likelihood estimator um is well converge to this is asymptotic um consistency welcome will converge we'll actually find the actual minimum so um for regular model we should just use the gaussian family and Center it around the maximum lack of estimator and try to match all covariance well Dimension covariance is not that easy so um that's another thing but um it's just um the situation with singular model is much more different so to illustrate that let's go back to the first board okay um so in the first part there is a diagram there where um we are plotting um the posterior distribution the control part of the posterior distribution not normalized um for a singular model and uh it is a singular model because um let me just write what the model is um at the bottom so the the model is a regression model with parameter a b so the x-axis is a and the y-axis is B um this is just a 10-h regression model A10 edge of BX that might be too small um so um I think this is a example that a lot of us have seen um the the set where things are not identifiable well every point is actually not identifiable but the the set with the the set with a similarity is the coordinate axis it's too small for me to see but is it perhaps y minus a 10 hbx all squared that is a really good point okay yeah so I should just write so this is this is y is equal to 18 h of X Plus gaussian noise BX okay cool yes thank you cool um so um in oh um let's look at the bottom row first in the bottom row uh we have a situation in which um not only is the model singular
meaning it's not identifiable um the the true distribution um uh that generates the data is so w0 is at at this at this point here so it is on the axis which means that it is actually on uh which which means that capital W zero instead of true parameters is has the singularity right in this case um if your um if you're if you're looking at the posterior distribution uh with data set size equal to 50 but it doesn't look like a gaussian even if you crank it up to 5000 it's never going to be gaussian right um but in the first row though that true parameter is away from the set where it is um where where w0 has Singularity and in fact it is the most critical point and you can actually find uh locally uh someone's score as representation of the local likelihood problem is at n Go to 50 it still doesn't look ocean at n equal to 5000 um is two doesn't look gaussian but if you crank it up enough this this tail distribution actually will sort of just get less and less relevant until everything is sort of concentrated around this so it's a very very degenerate um so it's it's like if you zoom in uh by just sampling more then it will become this kind of uh Recon gaussian like that and all the accuracy away from two or three standard deviation just doesn't matter anymore at high end um but this convergence is very very slow um the reason is that that there there is a um actually I sort of Cherry Picked this um example on top where um the situation is for for this particular data set um so uh recall that the posterior distribution itself is random because if you draw a different data set it will change and I cherry pick an example in which the maximum likelihood is actually here right that's kind of um so this this model has um is not identified this point is the same as this point but it's not the same as so it has a 180 degree rotation symmetric but it's actually sort of um the the Mexican language is very very far away um from the true the true parameter and um something that contribute to this kind of behavior is because um for low end the axes is kind of a reasonable approximation to the data and we know for a single learning theory that um for a hypothesis that is singular it has much larger fluctuation functional fluctuation um or singular fluctuation in that which means that this um the the hypothesis um hypothesis represented by the by this by this origin function can cover sort of a lot more um possibilities than um the none singular one okay um right um all guys to say that um
all that is to say that the variational problem for singular posterior is different um it's much more is richer and it's um the it's a it's a theory that actually requires um lots of um that in practice it requires a lot of engineering um thoughts to um like this so you're not saying that the posterior actually becomes gaussian asymptotically you're saying that it's just well approximated locally around the true parameter right you mean in the in the in the top row case right in the bottom roll case um it's got It's just navigation in the top of our case um if you crank and up to up to very very large value then um uh sort of uh we know we know that as as n becomes really really large a parameter that disagree with the data a parameter that produce output that disagree with the data a lot simply doesn't matter like the the way the posterior weight um that just becomes very very small right um which means that we only need to look at a very very small neighborhood of um comparing the posterior to a gaussian around the actual parameter we only need to look at a very very small neighborhood both the posterior is very concentrated around w0 and the gaussian at that at that level of n will be a gaussian with a very very narrow um variance which means then they match the posters do have tell that um that that is not gaussian but they don't matter anymore because the weight of their weight is so very very small right okay when n is large yeah yeah uh but the cartoon picture in my mind is like um if um that's w0 and the two posterior might have like so Wiggles and then as n becomes very large um it becomes like that so it's well approximated by the non-wiggly bits because most of the approximation cares only about that neighborhood does that make sense yes I mean when I you know the regularity being identifiability Plus um you know non-degenerate uh efficient information Matrix means that you can do a quadratic ball expansion um Taylor expansion anything around the true parameter and the sharper it is the better an approximation it is and so I'm completely with you there I just thought for a moment you were saying that all posterior somehow literally become gaussian sort of central limit theorem style um which I didn't follow but I don't think you're saying that no no I'm not saying that yeah I'm like in the bottom in the bottom row case where the um zero live in a single Assad um the posterior is never even even with the with with that um effects that I'm talking about it it
never becomes adoption and what does it become very direct your next Point yeah in the regular case it does sorry I mean in a regular case if your model is running everywhere identifiable in the in the regular case so in a singular case there is distinction between if your true parameter is lying on a singular set or away from it in the regular set in a really regular case there is only one possibility everywhere for whatever true distribution your um you have a quadratic ASP approximation near your minimum point in in a thing in a similar case then there that is not true I agree yeah um so uh maybe we can move on I don't know I I feel like I'm yeah dragging us down all right I'm happy to talk talk about this uh afterwards yeah maybe the end um sure um so um what if I have what does happen in a single thing uh in a singular case though um because we are in in the singular case in neural network deep learning um so in a singular case we know from SLT um this is related to your question uh in this code also I'm like I'm using singular in the sense that it's both it includes both regular and strictly singular um cases but anyway um from from from singular learning theory we know that the the the actual free energy has the following asymptotic expansion I've already normalized my quantities so um I'm studying I'm going to sit straight all the normalization on the rock okay so from SRT we know that um the free energy have the following asymptotic expansion where this quantity is the ROCT as we know and that's the multiplicity and um furthermore the generalization error which I forgot to Define just now but we uh let me just Define verbally so the generalization error is just the Chao Divergence between um making prediction so if we make a prediction using um the full Bayesian model averaging that will produce um a distribution of predictions but compare that to the true distribution there is still some error because um uh because we are not at the true distribution so the care Divergence between those two distribution is the generalization error well um from uh we we know that the expectation over this generation generalization error is equal to um increase in free energy this is a crucial relationship um and this is only true if we are doing prediction using the full Bayesian posterior um not any approximation to the posterior um then the generalization error have this very convenient representation and from that it implies that this is equal to equal to I think it can has I think on
the expansion Lambda on M and crucially that Lambda and that Lambda are the same now the problem with variationally method is that we don't have the theoretical problem um we don't have such uh relationship we don't we don't have this relationship between um if we use um the the distribution we use to make prediction is not the full base and posterior but using a approximate one then this relationship breaks down then we can which means that the asymptotic expansion of the variation of your energy um needs to be decorated smaller and the observation is that Lambda vfe for the the learning coefficient of the variation of very energy um is not equal to the coefficient um the the leading coefficient of the generalization error for variational generalization error what does what okay let me um let's represent this with um with the cartoon so singularly Theory sort of clarify some relationship here which is um we we have that that plot this is a cut this is a kind of cartoon representation um not claiming any quantitative so let's say we have Lambda here the rlct and from from the discussion just now about the the calculus of variation in quality we know that the elbow and therefore the um the version of free energy is always larger than the actual free energy which means that the the its coefficient is asymptotic coefficient has to be um larger as well so it's to the right um but um where does Lambda BGE what has the coefficient of the generalization error go go well let me find out that this place here is the asymptotic it's a leading order approximation of the variational gap the generalized generalization Lambda BGE can be in one of three uh one of these three places um where um if your so in in this situation in here this means that Lambda v g is smaller than Lambda vse here Lambda B GE is greater than Lambda BFE so um in in the middle situation it means that even though your um free your approximation is not as good as you want meaning you actually you never got to approximating the full posterior uh your but if you use um if you use the approximating posterior your um your your um generalization error is not hurt by Lambda bfv not being too very close to Lambda um so this is this is kind of a good case um this is this is a bad case right it means that if we if you if you um if you figure out your uh Lambda vfv your version free energy that is no guarantee that the generalization error has um the same asymptotic efficiency um typified by the the this coefficient
in this case though uh this is this this is this is very very good it means that um it means that replacing your uh the actual Bayesian posterior with um the variational approximated um distribution actually made for better prediction and you can imagine a situation in which um uh in which you design a family that is literally at the true distribution which which means that there's what there's no optimization going on and and the generalizing error is just zero all the time um so the Lambda is VG is zero that could happen but it depends on well it depends on your version of family which leads to the second point which is that all this is um different for different this this cartoon could have different configuration for different um variational family say q1 and Q2 we could have configuration so let's say the Lambda are still the same because we are learning the the learning the same phenomena the same data generating distribution the generalization Gap could be could be different so so we could have um Lambda v f e here and another model in which the generalization Gap is is huge um now it there is um something that as Susan noticed in the literature in that people generally don't distinguish between these two kind of effect um is there a question oh sorry I just unmuted myself it just I don't know if that creates some weird noise that's not the generalization Gap this difference between Lambda and Lambda v f e yeah I I yeah so my verbal is the description just now that is the leading coefficient of the generalization again what's the related to the variational generalization error right like Lambda v f e minus Lambda is just kind of the [Music] um um the accuracy Gap in doing variational inference yeah it's something like that right um it's asymptotically equal to that uh sorry I think I'm still very bad at ROBLOX um how do I know which board is currently the the big one orb cam at the top of the screen sorry yeah there used to be an orb cam but now it just says pockets and then there's an eyeball oh you've yeah you'd like to attach as listener um can you see uh if you look around over there where Ben is jumping yep you should see attaches apparently if only you were 10 years old children isn't it awakenly in a child Susan um and then then you should see it all cam at the top uh no sorry yeah I mean I I didn't mean to really interrupt except just that that this Lambda vfv minus Lambda is the is is it's just the accuracy of the variational approximation um whereas the generalization error is
really like a downstream error when you use the approximate posterior as part of your um posterior predictive distribution yep yeah um sorry yeah Technologies yeah yeah so so I always think so Lambda Lambda BGE the generalization um error is if you substitute your posterior the if you make prediction um instead of averaging over the posterior operation of the approximating posterior then you get a generalization error and asymptotic um asymptotic behavior that is Lambda BGE on n that's right that's right yeah that's right yeah oh sorry uh the the red part I um the red part is is meant to represent Lambda V of e minus Lambda not Lambda BG probably that was clean that's right yeah okay so uh so there is also a situation in which you might have um when you when you try to do a variational um uh this variational optimization you you find you find uh an approximating distribution that is very close um it's very close to to the true posterior meaning the Gap is small as in as in the first um I've been the first operational family and the gap for the second generation family is large but that sometimes doesn't mean that you might have better prediction in the first case than the second case because um the position of the um the generalization error coefficient might be um something like that might be here the other one might be here right so in which case um even though in the second case the the Gap is larger the the generalization is better um so uh this is still an open open problem what is the relationship between Lambda VG and Lambda vme um not having that relationship between generalization error and the free energy as with the full Bayesian case is um is it's not ideal um and at the moment this is still an open problem and it's only studied uh at a case-by-case basis for now sorry uh sorry if I can just interject a bit um uh so thanks I mean this is a really great explanation um it's just as someone who uses variational inference in in the real world um I can say that one of the main challenges of of using variational inference is that there there is no good way of um finding out if you've done a good job right so you have your variation to the post area it's it's really not clear how to know if you've done a good job that the only way that I've seen reliably um in terms of assessment is just to draw mcmc samples and then take that as the ground truth but this is this is not a good evaluation tool because if you if you bother to spend the time and resources to do mcmc then forget about
variational approximation right so that's not really a practical practical way of finding out if you've done a good job and what a lot of people in variational inference they've already noticed that maximizing the elbow which is the same thing as you know minimizing this this red Gap um even if you do a good job the generalization error Downstream for the variational approximation could be really poor as measured by the test log likelihood which is another way of saying this this GN that what what watanabi calls GN is simply the test log likelihood um so yeah there really is a big gap in in the practice of variational inference we optimize using one Criterion then we assess using another one but it's not clear that your optimization Criterion um at least to any control of the of the criteria that you actually care about um so the yeah understanding this relationship between these two lambdas is I think is quite interesting for for the practice of variational inference and could be a you know just a good contribution of what what SLT has to say um in terms of practical matters is there any task um yeah it's an open problem I want to ask is there any tasks in which um that so in deep learning it seems like a test logs is the almost a BL um of uh of of Downstream performance um but uh in in but there are situations in that that is um where people want to use version information are they situations in which are the tasks in which the approximation Gap is more important than um generalization laws for example I would I would think the variance estimate um would would be more related to yeah I know exactly what what you means like controlling this Gap this Lambda v f e minus Lambda you know even if it doesn't control this Downstream prediction are there other quantities it controls right meaning we can look at the difference between the moments of the true posterior and the moments of the approximate posterior so unfortunately I think the answer is no no like the even controlling this this Gap in the red it will not control um the error in the mean or variance approximation there there are actually a good example good counter examples where you you can do very well uh minimizing this this Gap um and your your meaning and variance approximations are really bad so a lot of this could be an artifact of the KL diversions so everything that you're presenting here is is for the kale diversions right um there are other divergences to use and I think I think if you use something like the
vasherstein distance um as a variational objective I I think that does provide you good control of of meaning and variance approximation but not not yeah the the the kind of a Meta Meta lesson here is that we kind of want to just what is the actual Downstream tough that we care about the general sort of rational inference theory has been uh concerned about um uh distributional matching um which which need not be best for every situation yeah yeah now there's definitely a disconnect between what we optimize in variational inference and what we actually care about but what we really care about is is hard to optimize so this is very common right to you really care about something that you can't actually optimize this so you use a proxy and you help that optimizing the proxy leads to a good Optimizer of the uh the thing you really care about um and you know the the variational inference that you're presenting here in the context of Bayesian neural networks is actually probably not the most prevalent um application of variational inference in modern deep learning right like it's more often that you see um variational inferences part of like a variational autoencoder and their there you're just kind of happy of the things that you generate look good this is rather subjective you know like it does look like a cat it looks more like a dog and you're happy um so yeah all right my comment before in which like version inference is used in um sort of parameter explanation context for example estimating uh the the mean level of can cell concentration can or DNA concentration or something like that in that case we will care about more about um the the parameter I mean uh you know that W matching um then we care about prediction in in that case we will actually care more about the Fe that Gap being small than the prediction being that's right yeah yeah um I I will move on because um I think I'm like uh hurry a bit um for the last few things um uh so um in the interest of time let me um summarize a little bit um so previously there's a word using a single learning theory so um SLT um gave us well gave us the the first main theorem is that in SLT um the main Formula One says that um that k n function has um has a has a standard form but to get to the standard form you actually need excuse me um you actually need to um with some very null hypothesis to apply um resolution of Singularity and G is resolution map in that case you get kg after clean back by the resolution map you get
um this normal Crossing form um and uh the second term this term is a gaussian process a conversational process so it kind of um has has a bounded behavior and um for the rest of this talk I'm just going to switch that under the rug because it has a 100 and um and it's the first term that that determines um leading order behavior um what what that means is that um to to go back to Russell's question earlier is that in the singular case um we we don't have asymptotic normality we don't we don't have um uh posterior converging to gaussian but what we do have is um um this posterior [Music] um go converging to something um uh to to a standard form so using this formula here um the Geo the the resolution map is a simultaneous resolution of K and um the price sorry the price also in um normal Crossing form right um so that that's the standard standard form of the posterior and there's a previous work by buttershire um group that uses this fact to um to to as an inspiration to to design a good variational family meaning we want to design a family recall that we want to design a family that um that can that is easy to optimize has um the the best member is close to the posterior um and it's easy to sample from and um that means that the easy function sample from part so usually there's this there's a pro if your um distribution is completely factored so it's mean to you um then it is easy to sample from and the design family inspired by this standard form is uh is is this form so it's looking at this and say that let's assume that instead of is uh uh this this part is already factorized let's just factorize that part assume that it is vectorized so uh it's equal to u to the so 2K1 times U1 to the H1 multiplied by dot dot multiplied D factors where D is the model dimension right so this is a mean to your family and it's a mean view where each component of uh of the of this factorization is uh is a gamma distribution well it's a generalized gamma distribution um uh the Computing the normalizing factor is involved involve a gamma integral but not quite a full gamma integral is a truncated government integral so um so using um using lots of lots of properties of special functions particularly the gamma function and a beta function we can um uh it was proven that the variational free energy so which is the proxy on how close um the family can approximate the tripod area is where that Lambda is the ROCT so what this what what this means is that if we use this family and
manage to find um the optimum then we have at least matched the leading order term of the free energy for recall that the free energy to negative that which means that the Gap at least doesn't grow um we at least Q off the reading on the term which is the login term right um um I won't go to the proof that um now the the problem there is that this assumed that everything I wrote on on the boat that is using the coordinate u in in which the posterior is in the standard form and that relies uh that that is after we have re-parametrized our model uh using the relationship W equal to G of U so it replaced using um uh the resolution map well I think most of us here already know that um getting a resolution map for for a simple model is hard let alone uh Neuron a large neural network model running a non-trivial true distribution so we don't actually have standard form in the well what we um so um Susan's proposal is to um foreign well the resolution function is is a function we don't know what the function is but let's just stick a neural network there and ask the neural network to learn the Assumption so the proposal um learn G using a neural network um I think it's more like that one two meters recall that the resolution function is an atomorphism away from the singularity away from measure zero set um so yeah so which means that we are effectively constructing a new variational family um which is that old family but then um uh pull back by this G function um so the new family is um she pushed forward [Music] of the the family just now um there are some other Primitives of that family just now there's beta um K in the rfg um so the relationship here just means that um the that in the that normal Crossing coordinate is given by um pushing the usual coordinate through this capital G function which is a new which is a neural network and this G function um has the property that it is um invertible almost everywhere um so this in fact is is a normalizing for architecture um so normalizing flow is uh it's a popular um a generative deep learning type architecture in which the the goal is to transform a very simple distribution to a very complex distribution where the complex distribution is the distribution that we want um and the way it does is so for example we we can start up start up with a simple gaussian distribution where the tank option is very very easy to sample from very very easy to generate data and then a gen generate sample and then move it across via this
transformation and get some really wacky distribution um right and uh and there are some requirements on this architecture that that so that Computing this transformation and optimizing it is easy um and that's the kind of the main constraint of um what kind of architecture you can put on G um and in in the paper we have we are using something called the Alpine coupling architecture um which you can can explain afterwards if you are if you're interested in so this this setup um I'm drawing here is the usual setup in which people usually have a base distribution which is a gaussian distribution and we uh and then learn this G to approximate a real a real world distribution um what we are proposing here is that instead of using gaussian as the base distribution we instead use a generalized gamma distribution um and go into that so instead of doing that we are doing that okay um I'm going to expand the next few minutes discussing uh the result of this um then could you put on some board uh that the next the pictures if that's possible um we might need to move to a new instead of two boats hello Danny are you there um then I'll be AFK so there's a few more pictures um uh that describe the result um and uh hello Dan is that you yeah I'm back sorry I had to step out for a second shall I put up the diagrams uh yes please um maybe we can move to a new set of phone yep sorry for the trouble no no there was an emergency where Russell's best friend has come over around the Nintendo set up life or death so I had to talk sorry uh see if I can get another one so while Dan is putting it up I should clarify that I have stripped all um all annotations so it's it's hopefully you don't accuse me of bad science photographs go up and to the right be impressed except except foreign [Music] yeah so um uh looking at the foot support um that's uh there's a bit too too to annotate here um so we did some experiments on the setup that we mentioned so the the high level goal is to compare so we have um uh the usual set setup is using a gaussian based distribution push it forward to learn a complex distribution and we are proposing using a different base distribution which is the generalized gamma inspired by um singular learning theory up inside that main Formula One so we are comparing um performance of different base distribution um another the variables vary is that expressivity of um the push forward the the normalizing flow um neural network the G of theta uh neural network I really shouldn't use data I've
overloaded the data um so what happens here is that the the red series is gaussian base distribution and the green series is generalized gamma um and um and these different plots are for different um uh true generating function uh generating data generating distribution true true test distribution right so uh the top one I think is um is is a singular model everyone everything here is a singular model so remember that our high level goal is to given a neural network um can we do Bayesian inference over the weight of the neural network um and the method that we use is to use a version of inference method and that we from Mainframe number one it means that we are using a normalizing flow to approximating to approximate the posterior distribution quite a bit of level jumping here um so there's sorry the details of these graphs is maybe a bit bit I don't know a bit much for everyone here I I maybe we can just give us a summary I mean we're just going to compare the two base distributions right if you didn't know single singular learning theory you would just use a normal zero one that's what everyone uses in the in the normalizing literature uh but because we have this cute little insight from SLT we're going to use this generalized gamma and then we're going to push this base distribution through through a neural network um an invertible neural network with simple Jacobian but so the thing is if you let this neural network be extremely expressive there's going to be no difference between the base distributions between any pair of Base distributions um so the the experiment is kind of set up so that you don't allow yourself to be very expressive so then you can isolate the effect of the base distribution and see if the the generalized Gamma One hasn't you know experienced as a performance boost and and the conclusion is at least based on the experiment that that we've done so far the answer is uh yes you do get a boost by using the generalized gamma um and by Boost and performance I mean just I get an A Higher elbow meaning I get a a lower Lambda vfe um but then when we switch to the generalization landscape um then yeah there's a lot of tricky Parts there because of you know what Edmund has explained before um so what that means is that the green lines the the one that use the generalization gamma on the left hand side here which is pots for the free the the free energy the green line is lower for a lot of cases x no except here but there is there there is a there's a
there's a problem with um comparing across different accessibility the the network that that try to mimic the resolution map um that have different expressivity um but going from the left to right where the right that the right hand column is plotting the generalization error um we can see that um for example in the first row even though the the green always wins out um in terms of being having better approximation to the posterior meaning lower free energy lower operational free energy it doesn't mean that um it is better in prediction um yeah um the the next two chart is really um sort of more of the same conclusion um but for more different uh more configuration of experiment so I think uh I will end it there sorry for going a bit over time uh if you have any questions please see us happy too discuss so there was my Quagmire from earlier yeah so which is a little priority that's my hesitation right now but we could go sure um uh so so my my question is is it are you are you saying that for a regular model asymptotically the posterior becomes becomes gassy that is correct for a regular model learning any data generating distribution the posterior um converge to a gaussian when n get large enough that is that is a theorem okay right um but for single model um for a single model there is um that that there is there is sort of two situation um so um so uh uh situation one is where where even though the the model is singular somewhere so there exists um there exists true distribution in which the um the optimal distribution has Singularity and there is another case so in that case in in that thing in in that true distribution is singular case um you you don't have the um quadratic approximation around the true distribution um which usually give you gaussian which is which is the necessary condition to get your gaussian right so that that's that's the um if you want to walk back with me to the board um that's the bottom bottom situation um oh in this in this graph so in the the singular set is here and and the true distribution the the distribution that generates data lies on that set that has Singularity um so in that situation no matter how high I crank um crank and locally around around that point it's never going to be cautioned gone sorry go ahead in the top situation um oh the lag is killing us isn't it um okay I know stop speaking for a long time and you go ahead I'm not going to talk for a while okay okay the the top situation is a situation in which um there is actually a good um a a local
coordinate in a very very small neighborhood so a neighborhood that doesn't cross that up such that the the the surface here has uh has a has a thermal Square foam so this is actually a most critical point um which leads to if you crank just crank and high enough um these towels will sort of sort of shrink towards that like the Tails becomes less and less important for the for the approximation and if you zoom in enough on that point you get you you get a gaussian description um so the gaussian to finite then for finite foot large and because it's regular and you know so the non-degenerate Hessian and all that you have a gaussian approximation and that approximation um becomes I mean so you know in the limit of infinite n it's um an arbitrarily good approximation so the posterior really is gaussian but with a finite um covariance uh uh uh not in the infinite so recall that that that that convergence uh usually have a uh the covariance the covering the standard deviation has a 100 m so the scaling and the covariance as a one a one on N spelling so so if you uh so that that corresponds to me saying sort of zooming in um far enough um but uh so if you if you if you scale the variance by n then yes there is a fixed um fixed value in which the asymptotic uh asymptotic asymptotes to that value so that that is just the the usual Central limit theorem um for example um uh what is that the central limit theorem approximation to the binomial um it uh so the number of binomial trial becomes become very very very large you what I'm saying is not very different from from that situation right so the the binomial is a discrete distribution it's it's not um and and uh oh that's actually a good example so so the binomial is uh is not only a discrete distribution it's it is it is zero in for for negative number of um trials right all right for negative number of realization it's only positive at um that but if you if you put a normal approximation on the binomial distribution it means that like the there is there is a non-zero probability that I I get um negative three number of heads or something like that that's that's that's impossible for the binomial but possible for the approximation but as n if you crank M up enough that um that normal approximation um and and if you um if you crank and up enough the binomial starts to look like um concentrate around uh the the mean probability and uh the normal approximation becomes very very sharp there and uh even though it is wrong at
the negative part it is wrong in a diminishingly small way as n becomes larger um similar things is happening here like um far away from that neighborhood it is wrong but us and crank up enough the the the difference in the distribution it becomes less and less important in fact exponentially less important um as you go far away enough from the neighborhood okay thank you um foreign yeah I don't know no question just um just verbal diarrhea like uh the neural network um used to learn the sync the resolution map um I mean you're careful to that's not singular with its own resolution I mean I just wonder it seems these resolution maps are so intractable and difficult to access and maybe even understand I wonder if I wonder what that gives you and it gives you any insight into the resolution maps in general um yeah um yeah I'm using this I'll find coupling layers uh this is just a simple architecture it's easy to fit I I did not really think very hard about uh yeah really how to design the approximation of the of the resolution map there's something very hand wavy in the pre-preprint where I'm just arguing that if uh if if my neural network is expressive enough I'll I'll you know I can learn the resolution map and then the guarantees that I theoretical guarantees I enjoy for the resolution map will will be inherited by this approximation but I've not really thought about it very carefully I think that a good future work would be just somehow baking in some knowledge we have about how a resolution map um actually what it looks like it but I don't know I I really don't know be clear that learning the resolution map is in some sense ill-defined because there is no the resolution that's right yeah that's right but the certainly the you know this map that I'm using has a has a triangular upper triangular um Jacobian which is really quite restrictive and there's a part of me that worries that um a no resolution map has this type of Jacobian structure so do you when sorry uh when there's no when you said you can't learn the resolution up because there isn't one you mean there's not a unique one correct yeah so you could certainly learn our resolution map in principle um yeah um lots of questions it's very but they're sort of like good sorry uh sorry uh I was uh I want to say that the sort of logic that um that this is relying on is that um if we have a large enough expressive enough neural network so that a resolution map is included in the uh in the family then at least we know that in