Feed forward neural networks are a type of deep learning architecture that require singular learning theory to understand their statistical properties. Watanabe's free energy formula states that the free energy of a compact subset of parameter space asymptotically satisfies the sum of the negative log likelihood and the log canonical threshold multiplied by the log of the number of training samples. Markov Chain Monte Carlo and Hamiltonian Monte Carlo are used for sampling from the posterior, and a two-layer feed forward rayleigh neural network is used to demonstrate a scaling symmetry that minimizes the callback label divergence. As theta increases, the posterior starts to prefer a one node configuration with lower complexity over a two node configuration, eventually leading to a point where the errors are comparable.

A feed forward ReLU neural network is a function from the input and parameter spaces to an output space, used in deep learning. Watanabe's free energy formula shows that even non-true parameters can be preferred in singular models. Neural networks are represented as graphs with inputs connected to a hidden layer of nodes, with an output layer composed of m outputs. Regression with neural networks is set up as a model truth prior triple, with identically independently distributed data. Neural networks are strictly singular models, with Watanabe's theory applying to regular models and more interesting properties for strictly singular models.

Feed forward neural networks are a type of deep learning architecture that require singular learning theory to understand their statistical properties. The prior is typically a normal distribution, but its form is not important in the long run. Bayesian posterior is a probability function that uses Bayes's rule to calculate the probability of the data set times the prior divided by the evidence, which is the partition function from statistical physics. This posterior is given by the formula Zm times the prior e to the negative n l and w, where ln of w is the negative log likelihood given by the mean squared error between the outputs and the predictions plus a constant.

Watanabe's free energy formula states that the free energy of a compact subset of parameter space asymptotically satisfies the sum of the negative log likelihood and the log canonical threshold multiplied by the log of the number of training samples. Model complexity is measured by the log canonical threshold, a positive rational number. The free energy is composed of accuracy and complexity, with the effective number of parameters near the most singular point being typically less than the true number of parameters. Singular learning theory suggests that models with lower effective numbers of parameters can generalize better, through a self-organizing principle to decide the best effective number of parameters.

Markov Chain Monte Carlo and Hamiltonian Monte Carlo are algorithms used for sampling from the posterior of a given data set. An example is given of a two-layer feed forward neural network with two inputs, one output and weights, biases and cues. A two layer two node feed forward rayleigh neural network is used to demonstrate a scaling symmetry that minimizes the callback label divergence. As theta is increased to pi on four, the true distribution is slowly deformed, resulting in a network contour plot that requires only one node to describe it. This minimizes the free energy, generalization error, error and complexity of the model.

As theta increases, the posterior starts to prefer a one node configuration with lower complexity over a two node configuration, even if it is less accurate. At theta = 1.26, a phase transition occurs in the energy entropy trade-off between the one node and two node configurations. Numerical calculations show that as theta increases, the regions where the errors differ become narrower, eventually leading to a point where the errors are comparable, indicating a preference for the one node configuration.

A feed forward ReLU neural network is a function from the input and parameter spaces to an output space. It is a simple architecture used in deep learning practice, and is tractable in that the true parameters of simple versions can be calculated. Watanabe's free energy formula shows that in singular models, non-true parameters can be preferred by the posterior, even if they are not the minimizers of the negative log loss.

Neural networks are represented as graphs, with inputs connected to a hidden layer of nodes. This layer is composed of affine functions with a ReLU value function. The output layer consists of m outputs, with varying widths. The neural network is a piecewise hyperplane, with affine functions parameterized by weights and biases. Regression with neural networks is set up as a model truth prior triple, with identically independently distributed data.

A regression model is a probability distribution on the output space given by a neural network function with some noise parameter. This model is very standard in statistics, with linear regression being the starting point. Neural networks are strictly singular models, and the properties of the function of fxw determine whether it is regular or strictly singular. Watanabe's theory applies to regular models, while strictly singular models have more interesting properties.

Feed forward neural networks are a good starting point for understanding modern deep learning architectures. They are strictly singular, meaning that the Fischer Information Matrix is degenerate. This means that singular learning theory is necessary to understand the statistical properties of these networks. Asymptotic results about the posterior are independent of the exact form of the prior, which is important to note. The prior is typically a normal distribution, but the precise form is not important in the long run.

Bayesian posterior is a probability function on the space of parameters that uses Bayes's rule. It is written as the probability of the data set times the prior divided by the evidence. It can be simplified as the product of the model evaluated at all data points, with the evidence term being a normalizing constant. The posterior is given by the formula Zm times the prior e to the negative n l and w, where ln of w is the negative log likelihood given by the mean squared error between the outputs and the neural network predictions plus a constant. The evidence term is known as the partition function from statistical physics and is a normalizing factor.

The free energy of a compact subset of parameter space is given by the negative log of the partition function. Watanabe's free energy formula states that as the number of training samples increase, the free energy asymptotically satisfies the sum of the negative log likelihood and the log canonical threshold multiplied by the log of the number of training samples. Model complexity is measured by the log canonical threshold, a positive rational number.

The free energy of a model is composed of the accuracy and complexity. The effective number of parameters near the most singular point in the subset, known as the RLCT, is typically less than or equal to the true number of parameters. It is possible for a model with higher error but lower complexity to have lower free energy than one with lower error but higher complexity, due to the trade-off between error and complexity.

Singular learning theory suggests that models with lower effective numbers of parameters can generalize better, even if they produce less accurate results. This is achieved through a self-organizing principle, where the posterior can decide the best effective number of parameters to have across the space of parameters. This differs from regular models, such as polynomial regression, where the degree of the polynomial is fixed and only the lambda varies across the space of parameters.

Markov Chain Monte Carlo and Hamiltonian Monte Carlo are algorithms used for sampling from the posterior of a given data set. These samples should concentrate in regions of high posterior density, which is equivalent to low free energy, low generalization error, good models and low error and complexity. This is due to the trade-off between error and complexity, where the free energy is equal to the sum of both, appropriately scaled by n and log n. An example is given of a two-layer feed forward neural network with two inputs, one output and weights, biases and cues. The truth can be deformed from a distribution that requires two nodes to one that only requires one node to fully express the distribution. This is the minimizer of the callback label divergence.

A two layer two node feed forward rayleigh neural network is used to demonstrate a scaling symmetry that minimizes the callback label divergence. The true network contour plot requires two nodes to describe it, with two activation boundaries where the node goes from being inactive to active. As theta is increased to pi on four, the true distribution is slowly deformed.

The posterior plots show that for sufficiently low theta, two nodes are preferred as a one node distribution is very inaccurate. However, two small regions of concentration around the origin and 0.2 or 0.1.5 indicate a configuration with only one node, where one of the weights is zero. This suggests that the posterior prefers a one node configuration with lower complexity despite two nodes being strictly necessary for maximum attainable accuracy.

At theta = 1.26, a phase transition occurs in the energy entropy trade-off between the one node and two node configurations. The one node configuration is less accurate but has lower model complexity, so the posterior starts to prefer it. This trend continues until theta = pi/2, where the one node configuration is dramatically preferred to the two node. This shows that the posterior can prefer a one node configuration even if it is less accurate.

A numerical calculation was done to compare the errors of samples from two different regions of the posterior. The degenerate region was around the origin and the non-degenerate region was the in between. The errors were calculated by taking all of the samples from these two regions and computing the error of each sample. As theta was increased, the regions where the errors differed became narrower, eventually leading to a point where the errors were comparable. This meant that the posterior started to prefer the one node configuration with lower complexity.

okay you can take it away when you're ready great good morning everybody so this morning i will be talking about neural networks and the bayesian posterior so our aim for this talk is going to be to use a concrete example of neural networks which are strictly singular models and therefore interesting to [Music] singular learning theory or interesting examples of where singular learning theory is useful um using these we will introduce the posterior and the free energy thus explaining the intuitive understanding of watanabe's main formula or main free energy formula which is free energy is n times the min log loss negative log plus and w plus the rlct times log n so our aim will be to explain this formula with um the plots that you see are already up in the world over there which i spent a long time on for my thesis and what we will see the upshot of this will be that in singular models non-true parameters can be preferred or can nonetheless be preferred oops be preferred by the posterior that is to say that in regular models when the complexity is only measured in terms of the total number of parameters in the model we should usually prefer that model which has the best accuracy which is to say the lowest log loss negative log loss but in singular models this free energy formula will tell us that sometimes parameters that are not um the minimizers of the negative log loss can nonetheless be preferred by the posterior cool so i'm going to hop over to the next board just for the sake of having everything on the same one okay so first of all feed forward relu neural networks so i will briefly introduce this uh architecture of neural network which is quite simple compared to architectures that are used in modern deep learning practice but relu is a very common activation function that is used and these offer a good starting point and they are also well they are tractable in that we can actually calculate um the set of true parameters of simple versions of these models which is a tick we can't necessarily apply all of watanabe's theory to it we can't calculate the rlct particularly easily but they are a good starting point so definition a i'll abbreviate this is feed forward rally neural network feed forward relu neural network is a function from the space of inputs times the space of parameters so this is the number of inputs into the statistical model this is the parameter space w function from these two spaces to an output space where m is the number of outputs of the

model which we represent as a graph so a typical neural network graph will look like the following we have a few inputs which then connect to a hidden layer of nodes where these connections represent the composition of an affine function with the value function so i'll write that down a bit more precisely in a second but for the sake of the graph we can have a number of hidden layers in there l minus one where each successive each successive um layer is another composition of affine functions with uh the relu so might come in like this and then we have an output layer which will be m outputs so these layers can have different widths and that is part of the network architecture um so we represent the neural network as this graph which induces a function that looks like the following so these ls here denote affine functions which is to say have the form or apologies where each affine is parameterized by weights wl which are effectively the gradients and biases noted bl thus giving rise to an affine function that looks like the following where z is some vector so the um uh the domain of these affine functions is the number of nodes of the layer that it is referring to and the um dimension of the output of these iphone functions is the layer that it might be going to for instance so and here we have for those that haven't seen it before rally u of x is defined to be x when x is greater than or equal to zero and zero otherwise and so it looks like the following which is clearly non-linear so what this means is that these functions ultimately arise as piecewise hyperplanes um and we will see an example of this when i come to the experiments at the end uh so yes and i'll also note that the value function here is vectorized in that above definition which is to say that the value applied to some vector is just value of component like that [Applause] great so these are our feedforward value neural networks and they are going to be the concrete example that we will look at today so i'll hop over to the next board to start explaining how we can set this up as a regression model all right regression with neural networks so we want to or we will watanabe does package this up as a model truth prior triple where each of these will be denoted by the following notations so to start with the truth we suppose that we have identically independently sorry iid what does it sample identically independently distributed data i i d data dn oops always hit that jump button d in of input output

pairs so for example it might be the inputs could be the pixels of a picture and the output might be the animal that is in that picture so a dog a cat a monkey whatever this data is supposed to be drawn from an unknown true distribution which is what we wish to model oops can i get rid of that big line yeah cool unknown true distribution denoted q of y given x um so i won't mention this in great detail here but um we do suppose that we know the input distribution q of x so it is the conditional distribution of the outputs given the input that we suppose we don't know which is what we wish to model so then the model given some neural network function a feed forward really neural network function f of x w the regression model is given by the following so it is a probability distribution on the space of outputs which is equal to the function the neural network function plus some noise parameter which is oops distributed with say a standard normal distribution so as a density that simply pops out as being the exponential negative a half the euclidean norm of y minus the neural network's output where that norm is taken over the space of inputs because we recall that y is an element of ah to the m so the regression model i i like looking at that first form of it a bit better the second one is obviously easily is interpretable but the first one really tells you that the regression model is saying we do think it's this neural network but we add some um some standard noise to it for uh to keep it within bayesian statistics i guess um yeah so thus for any given w for any parameter it is a prob density on output space and then uh i'll note here as well that the regression model is very standard with it within statistics and the starting point for anyone learning about it would be linear regression so you suppose that f of x w is mx plus c or w1 x plus w2 um and what is really interesting about this in our context is that it is the properties of the function of f x of d of f x w that determines whether the model is regular or strictly singular so a linear regression model is not strictly singular it is just a regular model and so watanabe's theory applies to it but it doesn't really have anything particularly interesting to say about it that standard statistics doesn't already tell us however feedforward neural networks are in fact strictly singular models which i will state in a second so it is the properties of fxw that we are particularly interested in the final part of this triple

is the prior which is according to casellas and berger a subjective distribution oops this distribution [Music] of parameters w based on the experimenters prior beliefs about what they think the parameters should be so standard one is to say that the prior is just a standard normal distribution and i think it is worth pointing out here that i'm going to say typically because i don't know all of the literature quite well enough to make a full assertion about this but typically asymptotic results about the posterior are independent of the exact form of the prior which i think is important to note because what that tells us is that we don't necessarily care so much about what the prior is in um or as the number of training samples goes to infinity it's mainly important for when we have finite samples which is always but always in practice but nonetheless this is the picture that you want to have in your head um cool all right i'll hop over to the next board yeah so i can see comments from matteo medical there um which i think is matt uh some things matter example the prize should not be zero in important places yes so at a theoretical level the price certainly matters for things like that i guess my emphasis there is that the precise form of the prior what you choose what the experimenter chooses to be the prior before they have seen any experimental data is not necessarily important in the asymptotics beyond theoretical considerations about where it is zero and that kind of thing but if you're taking a prior on this on all of um all of the real numbers for instance with a standard normal then it doesn't matter whether you're taking the normal with a standard deviation of 1 or of 10 etc these are all irrelevant in the long run so why do we care about these feed forward rally neural networks well as i said before they are a good starting point for understanding modern deep learning architectures but in particular under the regression model they are strictly strictly singular which is to say they have a degenerate fischer information matrix that is efficient information matrix which edmund spoke about last week which has the form once you do a few calculations and simplify it down a bit has the form of the following is degenerate so a strictly singular model is one in which the fischer information matrix is degenerate or at least that is the interesting case so as i say the upshot of this is that singular learning theory is necessary to understand the statistical properties of

feedforward value neural networks and most other neural networks as well things like the tanh neural network and that kind of thing great so i'm not going to bother proving this today um i can send a link to it later because really what i want to do today is give you the intuition for the main formula so to get to the main formula we have to now go through the bayesian posterior so the bayesian posterior which is a probability function on the space of parameters that i spoke about last week in the lead-up to markov chain monte carlo the bayesian posterior given our data is written as follows so the probability of the data set times the prior divided by what's called the evidence but we can simplify this and i should say that unsurprisingly this posterior form is called the bayesian posterior because it uses bayes's rule to get there so we can simplify this because each data point that we have from the truth is iid and therefore that first term the probability of the data given our parameter of choice which is i should say this is the model here this is simply equal to the product of the model evaluated at all of our different data points and then since the denominator there the evidence term does not depend on w so that means it is essentially just a normalizing constant so that the posterior itself is a probability distribution that evidence term is given by an integral over parameter space of what is on the top there the model times the prior so this means that we can rewrite the posterior i'm thinking whether i do it on this board or the next one um i might start on this one so the posterior is given by the following formula z m times the prior e to the negative n l and w where these new objects here are simply ln of w is the negative log likelihood and is given by the following formula which is the apologies for the crammed writing down here which is the mean squared error between the outputs that we know and what the neural network predicts plus a constant so ln of w is one of the more important objects in all of this setup because as you can see it is the mean squared error and thus is sort of the first first thing that we wish to um we don't necessarily wish to minimize it as we will see but we want it to be pretty good right you want the model to fit the data pretty well and zn there is essentially just the evidence term that i have said above but we call it the partition function coming from statistical physics so it is again the normalizing factor

is the partition function great okay so the posterior tells us which parameters w or which regions of parameter space are most likely to [Music] fit the data of our data set dn and so regions of compact subsets of parameter space that have good models will have greater posterior density so i will make sense of this on the next slide but that's the picture that i want you to have in your head we're looking for regions of posterior density in our search for good models um great so let me move on to the next slide okay so this brings us to the free energy so the free energy is simply an adjusted or an effective hamiltonian of the posterior density so really we just define this for theoretical reasons but you should have in your head that the free energy is essentially equivalent to posterior density but uh sort of flipped so the free energy of a compact subset of parameter space is given by fn of that space negative log of the partition function over that space which is to say negative log of the integral of the model times the prior over that subspec subset i will also remark here that not only is it convenient for theoretical reasons to consider the free energy in this form that is but the generalization loss which i won't define precisely here simply because it is a bit of a tangent you've got to go through the bay's predictive distribution to do it in the what in the way whatever does but the generalization loss is the average increase in free energy with n with the number of training samples so that is to say that the free energy is very much related to the generalization loss of models in in the region um mathcal w so this brings us to watanabe's free energy formula so the key point of watanabe's work which we are going to discuss in great depth in future seminars but i will simply brush over today and introduce the intuition for is that model complexity in singular models is measured by oops measured by the real log canonical threshold denoted lambda which is a positive rational number so the theorem which is in watanabe's 2009 book main formula number two states that in singular models as the number of training samples denoted little n tends to infinity asymptotically the free energy satisfies the following formula in times the minimum of the negative log likelihood plus lambda log n so and here i should say that we're assuming as above that mathcal w is some compact subset of parameter space so what i like to call these two terms there are a few ways of going about it but this

first term here is essentially the error term of the best parameter in the space that we are considering in the subset that we are considering i should say so that is to say that the free energy is comprised of the accuracy and the complexity so this term i think of as complexity in a physical sense one can also think of this first term as being the energy of the system and this second term as being the entropy so recall that the negative log likelihood is the mean squared error in between the predicted output and the measured output from our data set and so the free energy is comprised of minimizing that mean squared error and the complexity of the most singular point in the space that we're considering in the subset so remarks here what is this rlct so i think the easiest way to interpret the rlct is by saying that two lambda is the effective number of parameters near the most singular point most singular which which i'll put in quotation marks as well most singular point of our subset so that is to say model complexity is essentially the number of parameters in the model but the point is that in singular models we can have models that have an effective number of parameters that is much less than the true number of parameters afforded to the model so for regular models lambda is equal to d over 2 where d is the number of parameters afforded to the model but in singular models the rlct is in general less than or equal to d over two so i'll go to the next board so the upshot of this formula is that suppose we have two sets two subsets w1 which has low error which we want to note by this so the the subscript w0 here is just shorthand for saying the minimum attainable error over our subset suppose it has low error but high complexity so suppose that the rlct the effective number of parameters is high but that we have another subset w2 which has a higher error but lower complexity so it might not fit the data as precisely as models from math cattle w1 but it has a lower complexity which is to say it might generalize the data better then the formula says that because of this trade-off between error and complexity it is possible for w2 to have lower free energy despite being less accurate so having higher error so i will remark here that the that saying this precisely um is a yeah i need to be careful with my words here but saying this precisely does depend on a bit of a conjecture from watanabe and really what the story that i'm telling here is more about what you're about to see in

the plots so watanabe's work currently only deals with the case where these regions contain some true parameter and thus minimize the negative log likelihood minimize the error but we can see in the experiments that i'm about to show you that it seems to be at least in what i've considered it seems to be true when it doesn't necessarily contain a true parameter as well and that's the intuition that i want to build so um yeah so this fact is a key feature of strictly singular models the effective number of parameters are more flexible in these singular models than in regular models meaning a posterior can concentrate around models that generalize better even if they are less accurate so we're nearly up to the experiments let me just explain what we're going to be looking at mural network posteriors using a quick question yep so um it's not really unique to singular models that a lower number of parameters might have higher error but better generalization um for example that's the case with polynomial regression um you know you get really crazy fits that match the data perfectly well with a high degree polynomial um you're fitting to the noise and so you're better off with a lower complexity model even at the cost of some training error so what's the difference here what is singular learning theory saying that's more than that yeah it's a good question so the way that i like to think about it is in terms of this term flexibility which at the moment sounds a bit imprecise but some sort of kind of self-organizing principle in a sense so essentially what my interpretation of what singularity gives you is that you've got the flexibility with the model to um to have a model that has a very low or has a lower effective number of parameters in it and that varies across the space of parameters that varies across w whereas you can only get flexibility in something like polynomial regression if you are say the experimenter sitting there at your computer iterating over i'm going to choose a polynomial of degree 2 degree 3 degree 4 degree 5. so i like to think of it as this kind of self-organization where the posterior itself can decide what it thinks the um best effective number of parameters is to have and that differs across the space of parameters whereas that's not true in polynomial regression or any other regular kind of regular model which is to say that d is always fixed whereas lambda varies across the space of parameters in singular models i think um that makes some sense so in polynomial

regression you're right model selection i think is what we call that practitioner changing the degree of the polynomial um and then they're left with a choice to make of which one to choose and so are you suggesting that um what's nowadays theory puts forward uh principle for model selection yes yeah i would say so great does that answer your question yes thank you no worries and perhaps dan will have some thoughts on that at the end as well i would have interrupted if i thought your answer wasn't good cool the fact i haven't heard from you yet is very nice okay so recall from last week that markov chain monte carlo or also hamiltonian monte carlo that i talked about so in supplementary 2 is an algorithm for sampling from [Music] the posterior p of w given d n so we expect these samples to concentrate in regions of there's an equivalence to having your head here so we expect them we expect the samples that we get from markov chain monte carlo to concentrate in regions of high posterior density which is to say low free energy which is to say low generalization error understood in the correct sense which is to say good models which is to say low error and low complexity but as we have seen from the trade-off where we have the free energy is equal to error plus complexity there might be a trade-off between error so they don't necessarily have to be the lowest error or the lowest complexity but it is the sum of those appropriately scaled by n and log n all right so in our example um i might go on to the next board in our example we will have a model that is a two layer feed forward really neural network with two nodes oops two nodes two inputs one output which is a function of the following form so here these cues at the front are also weights but i in my thesis i have notated them as cues taking from oops i forgot the queue at the front there um simply for uh ease of notation but we don't need to worry too much about them in this anyway and the c there is also a bias so this is the form of our model where these weight vectors uh two dimensional vectors the q's the b's and the c are all single dimension and our truth will be the same but we will deform we will deform it from a distribution that requires two nodes to one that only requires one node to fully express the distribution that is to say that our truth is realizable by the model thus the set of true parameters which is the minimizers of the callback label divergence which when you carry out a calculation

shows you that it is the set of parameters such that the model is equal to the truth is non-empty so there do exist parameters that minimize the callback label divergence okay the final thing to say is that when we are reading the plots i've spoken about these experiments a good few times now and every time that i get to this bit i'm like oh how do i how do i simplify the um some of the complexity that's going on in what these plots are actually showing but i will just put it to you that to read the plots we have a posterior of weights with the first dimension on the x-axis and the second dimension on the y-axis such that these weights are normalized so we actually only care about what i denote w hat which is normalized by the scale factor out the front so this is essentially saying that the posteriors are of the effective gradient and we can do this because these neural networks satisfy a scaling symmetry that means that the posterior is oh not the posterior sorry but the function is invariant under this kind of scaling transformation which is actually i should say what gives the degeneracy of the fissure information matrix this scaling symmetry is kind of the essence of the singularity structure of um these two layer feed forward rally neural networks and we will also have these two nodes these two effective weights i should say superimposed onto the same plots and we can do this because there is a permutation symmetry in the nodes in our model okay so i can mention a bit more about that when we're actually looking at the pots but i reckon now is a good time to run over to the experiments okay so you can see here on this first board we've got the true network oh well we've got the model which is what i wrote on the other board and the true network that we are using here is a two layer two node feed forward rayleigh neural network where the when theta is zero looks like this first plot here so this is the function itself this is a contour plot of the true network and so we can see here that it definitely requires two nodes to describe it because um uh because we have two activation boundaries which are these lines here where the node goes from being inactive which is when relu is composed with a negative number versus active which is when the inside here is positive so we start with something that requires two nodes and then we slowly deform this true distribution as follows so you can see here we increase theta to pi on four and it is still a distribution that

requires two nodes but then at precisely theta equals pi on two this true distribution actually only requires one node so what we are saying here is we are going to see whether the posterior prefers parameters that have one node being active and the second weight is zero or whether it prefers a more complex model where both weights are non-zero so intuitively we look at this first plot down here the a degen and think that that has less effective parameters in it that only has well it only has one effective node which is to say three effective parameters w one one w one two and the uh bias of that node everything else can be zero whereas this second second distribution is expressing precisely the same function but with a more complicated model it is using two nodes to describe it and so our question is which does the posterior prefer now you should be thinking to yourself based on the free energy formula it should prefer the one with lower model complexity which is to say we expect it to prefer the first one here a degen where one node is active and the other one is zero the other question we might ask ourselves is in between here is there some theta value where we still require two nodes to precisely express the true distribution however the posterior actually prefers one node which is to say does it ever prefer a one node configuration because of having a lower complexity despite two being strictly necessary to have the maximum attainable accuracy or that is to say the lowest error so without further ado i think we can hop over to the posterior plots over here so just tell me when you want to move between them yeah yep great so here the red dots indicate the two arrows that you can see in that first theta equals zero board over there um here theta is 1.08 just because there's not much that happens in between theta equals zero and theta equals 1.08 a lot of the interesting stuff happens from here on in so this is saying that for sufficiently low theta the posterior act absolutely prefers having two nodes because a one node distribution here is very inaccurate if we then scroll forward the next one we start to see here that the posterior still prefers having two nodes but these i am without a pen here so hopefully everyone can orally interpret what i'm saying but these two smaller regions of concentration that are about the origin and sort of 0 2 or 0 1.5 that is referring to a configuration with only one node so where one of the weights is 0 hence the origin and one is

two so one plus one which are the other nodes um and so we can see here that it still prefers a two node configuration because the accuracy sort of requires it to because a one node would still have too much error but there is still a small amount of posterior concentration for this one node configuration and if we flick to the next one we start to see what this energy entropy or error complexity trade-off is starting to look like so here we can see that the trade-off is pretty similar between the two so the regions where the red dots are which are the two node configurations is approximately equal to or has free energy approximately equal to that of the one node so the one node configuration is still less accurate but the complexity is starting to trade off in a way that the free energy is actually starting to be lower here than for the two node configuration and if we keep going we can see this same trade-off occurring again and if we keep going once more we really start to see that the posterior is really starting to prefer the one node configuration now yet at this theta value it is still uh not as accurate as the two node one but it is still nonetheless being preferred by that by the sorry by the one node configuration what am i saying the one node configuration is being preferred by the posterior here is what i meant to say so that actually answers our second question does it ever prefer a onenote configuration even if it's less accurate even if it doesn't contain a true parameter according to this plot yes it does and if we continue on we just see the same trend occurring and if dan scrolls right through to the end here in a nice slow fashion we get to theta equals pi on two where we see that it dramatically prefers the one node configuration to the two node which answers our first question which configuration does the posterior prefer well it prefers the one with the lava free energy which is the one with lower model complexity which here is the one node configuration so if we go over to the final board here we can see this trade-off in flight so this first these plots were designed for my thesis not for the talk so some of the labeling might seem a bit abstract essentially what you want to think of here is the relative density of the two configurations so recall that the gen is saying the one node non-degen is the two node so matching what we saw on those posterior plots we see that at theta equals about 1.26 there was an exchange there was a phase transition

where the one node configuration became more preferred by the posterior and so this is the error down here and if we map that to this lower plot what we see is that it sort of roughly equates to when the error of the two regions became comparable so theoretically the error of the one node was always going to be worse but because of the fact that we were taking a finite sample of data points and because the regions where they differ were narrowing each time we increased theta there did become a point where the errors were comparable to one another which meant that the posterior started to prefer the one node configuration with the lower complexity which you can see happens about here and so for these subsequent values the errors start to be very similar to one another i reckon that might just be like yeah yeah it's the time as well sorry matt go for it in that last spot are you plotting like some kind of theoretical error of the sample where you force it to do either one or two parameters uh so this last plot here is a numerical calculation of the error of samples from the two different regions in the posterior that we saw on the other page okay as in uh you built them out of a whole lot of different runs of this training algorithm or sampling yes so you could pick one where in that case it did happen to choose say towards the end you did happen to choose the two parameter solution and then you computed the error of that that is sorry what was the last bit then so i'm wondering my fundamental question is how did you get these errors how did you how did you get the data for this plot and right i think i understand what you're saying is you you went into individual runs from which you built the posterior and some of them happen to be either in the degenerate region or the non-degenerate region and use the empirical error from those runs and you put that on its spot precisely yep absolutely spot-on so just for ease of um explanation in fact i might just do it on here so essentially um i did a form of um i guess just kind of i wasn't quite clustering analysis but something sort of similar so if we have the posteriors on this plot here one region corresponded to an annulus that looked like the following so the degenerate region was models that came from this region and about the origin whereas the non-degenerate region was essentially the in between here and so the errors and these density plots at the top here were obtained by taking all of the samples in these two regions and calculating