WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


Learning theory explores how to use models to predict behaviour. This video looks at degeneracy: when multiple parameters can produce the same model. It uses a pendulum example to explain the concept and discusses how precision, temperature, Occam's Razor, singular learning theory, and model selection are all relevant. Lastly, it shows how having knowledge of the system can help to find a non-degenerate parameterization with more dimensions in the parameter space.

Short Summary


Learning is the process of finding a model that can accurately predict the behaviour of a part of the environment. It involves adjusting a model to the information obtained from measurements in order to align it with the measurements. An example of learning is seen in a pendulum experiment, in which the initial angular displacement (x) and the angular displacement after time t (y) are measured. These parameters must be known in order to make a prediction and can be learned by taking a parameter vector w. The goal of learning is to infer the underlying process that explains the things seen, in order to produce more accurate approximations when work is expended on it.
The transcript explores the concept of degeneracy in learning theory, which is when multiple parameters can produce the same model. It uses the example of a pendulum to explain how the parameters lambda, g and t can all produce the same model and how geometry can be relevant to learning theory. It also explains how increasing precision of estimates of parameters affects the relationship between the energy of the process used to find the true parameters and the predictions made by the model. The search process is a simple harmonic oscillator with noise, and the optimal value is estimated by taking the mean of all the positions of the oscillator.
This transcript discusses the relationship between precision and temperature, the concept of Occam's Razor and its application to true parameters, the concept of singular learning theory, and the degeneracy of the mapping from parameters to models. It also discusses the importance of paying attention to the meaning of parameters and how having knowledge of the system can help to see a non-degenerate parameterization. Lastly, it highlights the importance of model selection and the benefits of having more dimensions in the parameter space.

Long Summary


Learning is the acquisition of knowledge or skills through experience. It is not the same as communication or programming, although they are related. We can consider learning as an agent in an environment making measurements of a computable process. This process generates signals that the agent can measure and receive. This can be done through hard coding or through learning in the sense we mean it. Learning can be used in many ways such as teaching a computer to play chess, as seen in systems like AlphaGo and AlphaZero.
Learning is the process of finding a model that can accurately predict the behaviour of a part of the environment. The models are parameterized by points in a space (W), and the process of learning is a dynamical process on W, where each step is a model looking for a model that is true. The process can be thought of as a mechanism behind the measurements observed, rather than the sequence of all possible measurements. The goal is to infer the underlying process that explains the things seen.
The transcript discusses the concept of learning and how it relates to the process of acquiring knowledge. Learning is the process of adjusting a model to the information obtained from measurements in order to align it with the measurements. This process is represented by the parameter w star, which when plugged into an algorithm, produces predictions of measurements for q that are the same in distribution as q itself. Knowledge is the end result of the learning process and is represented by the computational device that produces more accurate approximations when work is expended on it. Multiple points in the parameter space may exist that predict q exactly.
The transcript describes an example of a pendulum experiment, in which the initial angular displacement (x) and the angular displacement after time t (y) are measured. The probability of measuring y given x, and various parameters, is normally distributed. The formula for the actual displacement according to Newton is given. The example is used to illustrate the idea that multiple points in parameter space can reproduce a given result.
The transcript discusses the process of predicting future measurements of a pendulum given a list of xy pairs. It explains that the pendulum's motion is governed by a formula that depends on parameters g, t and lambda. These parameters must be known in order to make the prediction, and can be learned by taking a parameter vector w, which is a large integer. The measurements taken are xy pairs, and the model is either a formula or a distribution that is determined by the parameters.
The transcript discusses the dynamics of a pendulum, and the fact that the parameters of a model (lambda, g and t) affect the y-values associated with a given x-value. It is noted that the mapping from parameters to models is not injective, meaning that different parameters can have the same model. The mass of the pendulum is also mentioned, as it cancels out in the calculation.
The experiment is to determine the true values of the three parameters lambda, g and t, but the formula shows that multiple combinations of these parameters produce the same model. A graph is used to illustrate this, showing that as a goes to zero or infinity, all parameters are still true. The conclusion is that multiple parameters can reproduce the same function, and that the formula is not synonymous with the function itself.
The transcript explains the concept of strictly singular models, which are models where multiple parameters determine the same model. This is illustrated using the example of a pendulum and Newton's equations. The transcript also explains why geometry might be relevant to learning theory, and why the singularity or degeneracy of this model is important. Lastly, the idea of mixing up parameters is discussed, as it may be possible that there are not three independent parameters.
The transcript discusses the significance of a mathematical degeneracy that affects the learning process. It explains how one can use a single parameter to encapsulate two independent parameters and how this degeneracy affects the learning process, such as how knowledge is used and how one can gradually improve estimates of true parameters. The transcript also explains how the degeneracy can affect the process of finding the true parameter, such as gradient descent, and that in many cases one may never know the true value unless they are able to solve it analytically.
The transcript discusses how increasing precision of estimates of parameters (e.g. lambda, g and t) affects the relationship between the energy of the process used to find the true parameters and the predictions made by the model. The energy is measured by how close the predictions are to the true process. Near the true parameters, the energy behaves like a quadratic form, and the search process (e.g. gradient descent) looks like a simple harmonic oscillator moving around in a potential well. The transcript also mentions that there can be more than one true parameter.
The transcript discusses a model with two true parameters, where the error looks like a parabola or a flatter function near the optimal value. The search process is a simple harmonic oscillator with noise, and the precision is how close the search process is to the optimal value. The optimal value is unknown and only the blue part of the diagram is visible, so the search process is used to estimate the optimal value by taking the mean of all the positions of the oscillator.
Delta W is a measure of precision and Delta E is a measure of temperature. The relationship between the two is that to halve the box containing W, Delta E must decrease by a factor of four. Near W Dagger, the energy looks like X to the 2K and the calculation is that Delta E is Alpha Squared Delta W to the 2K. This means that to know the truth with twice the precision, the energy or loss must be reduced by four times.
The principle of Occam's Razor states that a simple solution is better. This can be applied to the concept of a true parameter, which can be isolated in a model. This means that near the true parameter, less decimal places are needed to specify the model, as the truth is present at a more coarse-grained level. This means that local changes can be specified with less bits and so by Occam's Razor, the true parameter is likely to be better.
The transcript discusses the concept of singular learning theory which states that some true parameters generalize better than others. This is due to the degeneracy of the mapping from parameters to models being more singular or degenerate near true parameters. Watanabe's theory explains the link between algebraic geometry and the foundations of statistical learning theory. This seminar examines this link and its implications for learning.
The transcript discusses how one can reduce degeneracy in parameter spaces, and how this can be advantageous for certain models like deep learning. It is not necessary to know algebraic geometry for this material, and one can eliminate degeneracies by changing the model. However, this can make it more difficult to search for the true parameter, and there may be good reasons to search on a larger, more degenerate space. Additionally, having knowledge of how the pendulum works can help to see a non-degenerate parameterization.
The transcript discusses the importance of paying attention to the meaning of parameters in a model, as they can be independent factors with their own degrees of freedom. It explains that if the parameters are mixed together, it can lead to confusion when the true distribution changes, as opposed to having independent parameters which can be understood as being independent. Furthermore, it highlights the importance of understanding all of the variables that determine the model in order to solve a problem.
Model selection is a big part of statistics and the tools discussed provide a way to make choices between different models with different numbers of parameters. Singular models provide flexibility in the parameter space, allowing for a search process across different numbers of parameters. This is not the case in regular models. It is suggested that having more dimensions in the parameter space allows for more movement off the singularity, allowing for better learning.
Adding parameters to a model doesn't guarantee better results. It is important to understand the true generative process of the phenomenon being modeled, rather than artificially making the model more singular. Real world phenomena are often much more complicated than any model, so adding an insane number of parameters to deep learning models won't necessarily help. The mathematics of resolution of singularities can explain why adding parameters may help in some cases, but the storyline requires a deep understanding of the concept.

Raw Transcript


okay so let me start what is learning uh so learning i guess is an informal term and we're going to drill down and give a more formal interpretation of it and then i'll start talking about the mathematics that might be associated to that but to start with a dictionary definition which will refine learning could be said to be the acquisition of knowledge or skills through experience now that definition is a bit too broad because you you can take say experience to mean just acquiring information from the environment right so if somebody flips a coin and then tells you the result that's information and you could say you'd learned the outcome that's a english phrase that wouldn't make people bat their eye but that isn't really what we mean by learning in the present situation right that's just getting information now if you were told it's a biased coin and you don't know the proportion of heads or tails and you see many samples that is you get the results of many individual coin flips each one of those coin flips is some information and you would say that you're learning what the bias is over time that is a proper use of the term learning in the sense we're going to mean it but not the first so learning and communication are not the same thing although they're closely related of course and it's hard to imagine learning happening without some form of communication or information exchange within an environment now you could also say that if you program a computer to play chess that the computer has learned to play chess that's also another informal usage of this word uh but that's not the sense of learning that we really mean although you could have a computer learn to play chess as we know from alphago and alpha zero and so on many systems before that so there's a continuum of notions between somehow hard coding something and learning in the sense we're going to mean it but we don't want to just conflate these two things so learning is not the same thing as sort of programming although again they're somewhat related okay so here's the definition we're going to still informal but the definition we're going to to use so we're going to consider an agent so this is the thing that does the learning in an environment this is the place that the thing to be learned about lives making measurements [Music] of i'll say computable process q so q lives in the environment it's generating some sort of signals that we can measure receive as an agent and i say computable because
presuming that the agent somehow runs with bounded computation and can only manifest computable functions then somehow a process out in the world that's not computable really is hard to imagine as subject to to being able to be learned but that's kind of a you know controversial point of discussion for later so learning is the process of the agent searching for a model of q some part of the environment what's a model well once we get into more mathematics we'll be more precise about this but for the moment i'm just going to say a model is an algorithm that is you can you can do something you can expand work and compute predictions for what the process q out there in the world will do and then you can compare your predictions to the actual measurements and you say that the model is true if it reproduces at least in distribution measurements from q okay now i can take questions at this stage but i'll start drilling down a bit more into to what i mean so we're going to assume wait so it's cute something that is not necessarily a deterministic process that's right yeah for example the biased coin flips so the process uh suppose it depends how you think about it you could say the process is kind of like the generating mechanism behind the measurements that you observe the events that you observed or it is kind of the sequence of all possible measurements that that might happen and also that do happen uh i think it's better to think about it as the mechanism behind the measurements so if if the world was a simulation it would be the code that is running that generates the measurements and you're trying to infer what that underlying process is that explains the things you see please tell me if i'm standing in the wrong spot as a speaker i'm trying to manage both the ipad and the recording and so on so it's quite possible i'm getting it wrong so we assume the possible models so far there's no sort of geometry or even anything continuous necessarily but we assume the possible models are parameterized by points of the space which is going to always be called w now learning was the process of finding or searching for a model right and if models are parameterized by points of a space that means that learning is some kind of dynamical process on w moving around on w each step is a model looking for a model that is true and here's my little sketch for that's my head here's w he is some interesting thing in the environment i make measurements and maybe i start with a very bad
model i'll call that w1 and then i can steadily refine that model maybe the true model or chord w star lives down here but just because i take measurements doesn't mean i instantly sort of know how to reconcile those measurements with my model how to reproduce them with my model right maybe i do it's a very simple model and it's a very simple process but generally there's some process of sort of adjusting yourself to the information in order to align your model with the measurements and that process is learning at least in the sense i'm going to use it the word okay so w1w2w3 dot dot in some sense converging hopefully to w star that is learning so this is the truth in the sense that if you were to plug w star those numbers into your algorithm that is the model it would produce predictions of measurements for q that are exactly the same in distribution as q itself so this is the environment over here okay so in that diagram knowledge acquired by the agent it's different to just the information of the measurements right and the difference is exactly the learning process the information of the measurements it just sits there on paper that's not knowledge knowledge is the end result of this process of learning whereby you reproduce the measurements or at least in distribution do so using the information that you obtain by searching for a model okay so that's knowledge knowledge acquired by the agent is a parameter w star or since you never except a very simple examples you never probably actually really reach w star so maybe knowledge is not a point but it's the process by which you asymptotically approach the point or maybe even better the the gadget or computational device which when work is expended on it produces more accurate approximations that's probably the best i can do so as a parameter w star whose associated model correctly predicts q just a quick question yeah um should i imagine that inside queue is like uh a secret point in this parameter space that's generating the things that are being measured to a first approximation yes but my next point is going to be that there may be multiple points in the space of parameters which predict exactly q so the point about computability is going in the direction of what in the technical language is called realizability which is that i mean you could certainly imagine that your model is so stupid that no parameter actually reproduces q right and certainly you know many very complicated processes out in the world
can can never be matched exactly by somehow model that will ever be able to compute with but assuming it is possible to precisely match it with a model we do sort of have in mind that you've discovered the secret dna behind that process by finding w star that's right okay i'm going to give an example in which we can see this process and we can see the point i just alluded to which is that maybe multiple points in parameter space reproduce q and what would that mean so let's let's consider a pendulum and the measurements we make are the following kinds of pairs x y consisting of some initial angular displacement x i'll draw a picture in a second uh excuse me while i decrease the sum total of bitey organisms on the planet okay so that's the length of my pendulum one unit there's a mass at the end m i'm going to call this angular displacement uh i'll call it theta so the forces at work on this pendulum are gravity i assume that string tension cancels out the part that's uh parallel to this string and so what's left is uh this mg sine theta so many of you have done this exercise probably all of you okay so that's my system q the measurements i'm making of it are the initial angular displacements of the initial thetas and i'm going to say we're certain about those so we start the experiment at some angular displacement and the other measurement we make is we make is the angular displacement y after time t and that's subject to measurement error or i'll just call it noise which we assume is normally distributed okay so the probability that we measure angular displacement y given we started at x and given the various parameters that are involved so we assume these are fixed but perhaps not known so that's a normal distribution by assumption with some mean f that depends on x g and t right so we've got this is theta here's the truth that is if you were to actually perform this experiment this is where it would end up or perhaps it's better to say that if you solve the equations of physics this is where you would say that the pendulum must be after time t so that's an indication of the probability distribution that you'd actually sample from if you made the measurements okay so here f is the sort of actual displacement according to newton all right so let me write down the actual formula for what f is uh it's an exercise to do the do the physics if you wish [Music] but it's interesting to see in the context of the story i'm about to tell what it looks like so solving
so so making the assumption theta is small is the usual usual way you do this and you solve that d e and you'll find that f is given by the following formula oh i forgot to add one slightly artificial wrinkle okay so i'm assuming that there's initial velocity so assuming initial angular velocity uh v zero that depends on the angular position so that is you give the pendulum a little bit of a push and the magnitude of the push depends on how far away it is and let's say for simplicity that it's proportional to x okay so then you have this formula i'll call that double star i guess you can still see this on orbcam so i'll indulge myself and put a star over here as well for the distribution itself okay what's what's the point of all this [Music] we can attempt to estimate i mean what would it mean to predict future measurements well we can solve the equations and we can get this formula but we're given a list of x's and t's right the pendulum's swinging and somebody's communic maybe it's in a different room we're not allowed to mess with it but somebody tells us the experimental setup and gives us a list of x g pairs to predict new ones well one way we could do it is we could solve these equations and get this formula because we know it's a pendulum we know the laws of physics but to compute this formula we need to actually know what g t and lambda are it's not enough just to have the x right somebody tells you the x or maybe there's an agreed upon list of x's going into the future i don't care really about x [Music] but you need to know these parameters in order to make the prediction all right so this is the learning process right learning this this list here is the parameter vector is the point of w so w if you like you could take it to be for various reasons we kind of want it to be compact so let's i don't know pick a large integer and make w this space okay so our model which is either f or the distribution star they each determine each other so it doesn't really matter which we think of as the model but our model is parameterized by these three numbers and the question is which three numbers reproduce the distribution that we see okay are there any questions at this point sorry to interrupt yeah okay good uh i'm a little confused what the measurements contain because on the first slide i see that they're like x y pairs but then i think you said like xt pairs or xg i don't know if you misspoke at some point but what are the measurements worth taking uh the measurements are xy pairs
okay the because if i imagine a pendulum moving i can imagine getting different y values but yeah there's right there's a bit of us the x value is there's a bit of an artificial component to this which is that you only get one y value at a particular time t so somebody tells you they give you a pair of numbers 0.1 comma 0.2 and they say okay i released the pendulum at angular position 0.1 and then after some time which is fixed i use it for all the experiments but i won't tell you what it is that's capital t it wasn't y but i only get that one position of that okay per experiment and then i get another pair uh i mean if you like i probably get many pairs with the same x which would tell me you know they might not be the same because it's not deterministic right because there's noise in the measurements so even if you you could get two pairs point one comma point two and point one comma point two one uh because you're sampling from a distribution due to the noise in measurement and you know other sources of noise does that make sense yeah why is x part of the measurement and not just like because it's one of the parameters like why why aren't the measurements just y values yeah i think i think that's that's a good question that's part of the story here that i probably could have done more clearly let's let's agree that there's there's a known distribution of x values like say they're uniformly sampled from so let's say we agree that the experimental protocol experiment protocol is that x is uniformly sampled from i don't know minus pi on four to pi on four or something so you just you get the samples you know what the distribution of the x is going to be so that's that's known but you don't know given x what the distribution of y's is except from the samples because you don't know the value of these parameters lambda g and t you'll notice the mass dropped out here and that's because as you know if you've done this calculation the mass cancels out so that's another part here that maybe could have been more elegantly presented but really this model depends on these three parameters cool that's the second okay so what do i want to say now ah so here's a very interesting thing to notice look at the way that the parameters g and t appear in this formula so the map from parameters that is these triples to models is not injective i'll comment on the exact form of this non-injectivity in a moment but let me just make this point with a diagram first well someone tell me what what
what's something i can do given a model how can i change the parameters and not change the model you could increase t by like the period that the pendulum is yeah that's a good one that too yes that's another source of non-injectivity but what's one that actually interacts the parameters like do something to one parameter and something to another parameter and not change the model so make root j times t equal to constant right so i can i can double t and reduce root g by a factor of two that is reduce g by a factor of four and i won't change the model well except i'll have to do something to land okay so [Music] i'll move back over to the first board and we'll discuss the exact the exact form of this degeneracy or singularity okay so what knowledge is to be gained from this experiment well we we want to know the true values of these three parameters i'll call them lambda star g star and t star but in the formula we just wrote down so f if lambda star g star t star worked i.e the distribution of y values given x values for these parameters was exactly q then so two would the following triple or the following collection of triples if i multiply lambda by a g by a squared and t by 1 on a for any positive a all of those give the same model right so they're also true parameters okay so which which one is it then or maybe a better question is who cares they're all producing the same model i chose the parameters i chose the models that's not out there in the world right so why should anybody care that there's more than one parameter which reproduces the truth does that have any actual significance [Music] okay so let me let me plot w roughly speaking so on one axis i'll put t on another axis i'll put both lambda and g because they're scaling roughly the same so here's w star the r true parameter and here's a one-dimensional subset of also true parameters so this direction is as a goes to zero and this direction is a goes to infinity if you'd like this kind of ends out here at um zero zero infinity okay so they're all true okay so this is i think a pretty convincing example to show you that it's not weird to have a situation where multiple parameters or remember these are our sort of by algorithm here i mean this formula right i mean this formula is not synonymous with the function f the function f is properly understood to just be the set of all input output pairs computed by this formula the formula itself is is a means of computing that function okay so we're identifying parameters
with those formulas or algorithms multiple parameters determine the same model so that's the first point i i think it's important to understand and by model that's sort of a function or distribution f or p now i'm writing it like this because i think a very important point here is actually the same point as uh as the point in say theoretical computer science that there's a distinction between algorithms and functions and given a function there may be multiple algorithms which compute it uh for example there are multiple sorting algorithms that will put a list of integers in ascending order or there are multiple algorithms for computing multiplication of integers and so on so given a function there'd be maybe multiple ways of actually computing that function sort of in practice so to speak and that's that's a kind of fundamental point both in computer science and in learning okay [Music] so this kind of model where there may be multiple models and even sort of spaces of models like there's a one parameter space here this kind of model uh where there are multiple models determining this sorry multiple parameters the determining the true model is called strictly singular and this is the terminology of sumio watanabe whose book algebraic geometry and statistical learning theory [Music] is the primary reference for this seminar and these kinds of models are what singular learning theory is about the singularity here refers to the degeneracy that is the fact that there are multiple parameters determining the same model now the main point i want you to get from this example is this is not a weird situation right you couldn't imagine a more simple system to measure in the world than a pendulum you couldn't imagine a more canonical model than newton's equations and yet here we are we have a strictly singular model so you know if you think this is a weird thing then you're wrong now i want to get to the end of this by giving you at least the brief idea of why geometry might be relevant to learning theory and to singular learning theory in particular so that's one thing i want to do i also want to maybe give you some idea of why this singularity or degeneracy that i'm illustrating here matters right maybe you could just say well fine just somehow mix up these parameters if there's not really three independent parameters but something like two which is what's sort of indicated in this picture right because you can imagine these sort of level sets and maybe there aren't three independent
parameters but really only two sort of moving away from this locus sort of perpendicular to it you could imagine that well maybe g and t aren't really independent so i should just invent a single parameter that sort of encapsulates both of them that's right that's is something you can do but nonetheless the fact that this degeneracy exists does in many examples have a significance it's not it's not some mere phenomena that appears just in the mathematics we use to model things well uh it both appears in the mathematics we use to model things but it also strongly affects things like the learning process right because learning is not it's not just a question of can you fit the model to the true distribution the question of how you learn things and the nature of knowledge and how it gets used does depend on your model and these singularities do affect that that's one of the implications of what anabes work okay so i want to quickly sketch that and we'll spend the rest of the sessions developing all of this more formally and illustrating or proving that some of the things i'm asserting are true okay so i want to go back and talk more about the learning process so keep that example in your mind so we're we get we get samples from this pendulum and we're trying gradually to improve our estimates of what lambda g and t are that is we're sort of trying to get closer to this green line i drew in the in the plot starting at some point in w far away from it potentially so as i said earlier in practice we never know we may never know a true parameter to infinite precision perhaps you could look at that formula over there and solve it for lambda g and t or maybe you do gradient descent to find values of lambda g and t that reproduce the measurements uh or you know there's a bunch of other ways you might try and compute the true parameter but if their process if you're not just solving it analytically if you're doing say gradient descent well maybe it never actually stops right there's some finite precision to your experiment to your computation and you can get better and better approximations to the true parameter but in many cases you'll you'll never actually maybe know like the true value unless you happen to be able to solve it analytically the point i want to make is that the degeneracy or singularity i'm not going to explain really what this word means today right just think about it as being synonymous with the degeneracy that is the lack of injectivity of that map
so this degeneracy in the mapping from parameters to models well it affects the relationship between increasing precision so getting more decimal places in your estimates of lambda star g star and t star and i'll just call it decreasing energy i don't mean energy of the pendulum i mean kind of energy of the sort of process by which you're finding your true parameters you could think about that as the loss if you're doing gradient descent i.e how close your predictions are to q so again this will be something we defined carefully later but for now let me just say it like this so you receive these measurements about q from the environment you've got a model if you have some parameters in your model you can use it to make predictions and then you can compare them to the measurements you're getting from the true process and then you have multiple ways but you pick a way of comparing those predictions to the truth right the the kl divergence for example um but you have some numerical measure of how close your predictions are to q and you want to tweak your parameters so that that measure of closeness decreases and that measure of closeness is what i mean by energy okay so there are two things at play here there are the precision of your estimate of the true parameter and how good is your current estimate that is how close are the predictions to the truth from the environment okay um so i'm just going to call this e so that's the measure of how close the predictions are using w as your parameters w stands for triple lambda gt okay well let's do a little calculation so suppose for a moment that near this particular true parameter and there may be more than one e behaves like a quadratic form i can hear background speaking i think that's billy probably mute so suppose for a moment that near w star we have a formula like this alpha is just some parameter we don't really care about it but this is just to say that the energy has a bowl shape near the true parameter so in particular the search process if it was gradient descent would look something like a simple harmonic oscillator moving around in a potential well near w star maybe i'll draw that on the next book okay so here's a picture of billy will be sad if i don't use a straight line for that so here's my energy and here's the coordinate that indicates moving around in the space of parameters here's one true parameter here's another one okay so let's call so w star and w dagger are true parameters we know more than one exists just pick two
of them uh well i guess i'm conflating the general story with this particular example right now i'm talking in general so this is really a story about any model not this particular one with the pendulum uh but okay pick two pick two true parameters now my assumption is that e looks something like a parabola there but let's suppose it looks flatter than a parabola near w dagger so this is this is sort of like x squared and this might be like something like x to the 10. right if you zoom in between zero between minus one and one higher powers of x look flatter they approach a square well potential as the power goes to infinity okay so here's w star let me put on this diagram the two things i talked about precision and error so here's zero error here's a little bit of error e0 plus delta e now why do i care about a little bit of error well as i said the search process is maybe actually what we mean by knowledge not w star itself so you can't ignore the process and what does that process look like near w star well it's a simple harmonic oscillator it's going to i mean this is what i'm drawing here is a discrete approximation to a simple harmonic oscillator that you might get from doing something like gradient descent okay now there's noise involved then your search process so say this is stochastic gradient descent so you're getting samples from the environment and you're using them to look for w star but there's some noise in the sample so it's never going to be literally a simple harmonic oscillator it's more like our pendulum where it's being buffeted by small motions of the air and then such things okay so your your search process is going to bounce around in here and most of the time it will be below some energy level but it will never maybe actually get to w star so what what does it mean to get an estimate of w star well you could take the mean for example of all your positions of your simple harmonic oscillator that would give you some estimate of w star okay so that's one part of the story the search process the error and now let me indicate on the diagram what i mean by precision so precision is how close to w star well it's a bit of a fiction to draw w star on this diagram right you don't know where w star is unless you can solve the system exactly you don't even know that there's this bowl all you see is the blue part right so you're looking at this blue part and you think ah there's a bowl there and w star is kind of in the middle but you don't literally know where w star is
maybe you only have some idea that it's in this interval right so delta w is what i mean by the precision okay that's the setup now i want to give the relationship between delta w which is a measure of the precision and delta e which is a measure of the temperature or the degree to which your approximation process is really sort of um well close to giving you a particular value okay so by assumption right the energy at w star plus or minus delta w is approximately alpha squared times delta w squared so this is delta e right i drew these so that these meet that's the uh maybe that's not clear that's the relationship between delta w and delta e all right okay so what does the simple calculation say it says that to if you want your estimates to sort of cluster cl more closely together that is you want to decrease delta w so to halve the box containing w or w star rather well you must decrease delta e by a factor of four or to put it differently to know the truth with twice the precision you need to reduce the energy or loss by four times and you might think about that as reducing the temperature of the room in which the pendulum is swinging right so some work is required to do that you do some work you can get your approximation process to cluster more closely to the truth and the difficulty is related to what factor you need to reduce the energy by and in the case we're near the truth the energy looks like a quadratic form the factor is four okay i have one more diagram to draw and i think i'm kind of on time are there any questions about this picture before i draw the more interesting one at w star is this uh this kind of reminds me of the very first medium lecture we did you were talking about like the power law or something is this are these the two uh this is the same concept here [Music] well i guess they're somewhat related but i don't think it's fair to call them the same idea no okay i'll try and keep it on the same board i guess uh so let's suppose that near w dagger the energy the measure of how far away from the truth our distribution using the parameters w is uh suppose it looks like uh x to the 2k k here is some integer but i just i just mean it's an even power of x and i want the power to be a parameter well doing the same calculation that is thinking about some search process that's bouncing around here below with a given energy threshold [Music] you do the same calculation you'll find that in this case delta e is alpha squared delta w to the 2k
right not surprising well so what well that means that halving the box you have to reduce the energy by a factor of 2 to the k right if you put a half over here to cancel it you need to put 1 over 2k over there well seems like a bad thing right it's harder to get your estimates to cluster around the truth but you sort of need to stand on your head a bit and see that differently okay so rather than thinking about it as being i need to reduce the energy more to get my estimates to get a more decimal places in my estimates of w dagger think about it like this there are many parameters near w dagger that work just as well that is changing the parameter like changing the tenth decimal place in the parameter barely changes the model or maybe doesn't change it at all the way i've drawn it here the true parameter is isolated we saw already in the example with the pendulum that the true parameter often is not isolated you can make variations in the true parameter that literally don't change the model at all not just don't change it much okay so what's the significance of this bowl here at w dagger being flat that is there being many nearby parameters that don't change the model much well the way to read that or to understand it is to think of it as saying that the the truth is present at a more coarse-grained level at w dagger rather than at w star uh maybe i'll just erase here for a moment so if you're near w dagger you can use less decimal places and it won't matter right that is the the model you're using is effectively using a less high or lower resolution version of rn than you are if you're working at w star so maybe i'll write this the truth at w dagger [Music] is present at a more coarse grained level i.e using less decimal places so you need less bits to specify it well the principle of occam's razor is that a simple solution is better so you might expect that since i mean it's it's complicated i don't want to pretend this is simple uh of course if you want to actually write down both w star and w dagger you need the whole space and you need to write down how many decimal places however many decimal places you need to write down so when i say less bits i mean locally so locally near w dagger i change in a model comparable to a similar one at w star that needs 10 bits to specify near w star may only need five bits near w dagger so local changes can be specified with less bits and so by occam's razor you might expect the truth w dagger to be better a simple model is a better model
and a simple model is one that you needs less bits to specify and finally [Music] this better is a direct result of the map from w to p being more singular or more degenerate near w dagger uh if you think about uh i mean the mapping from x to x squared is less degenerate than the mapping from x to x to the 2k maybe that's something we can discuss later it's not meant to be immediately clear you might think about the viastras preparation theorem if you if you want to see the sense in which i mean that okay so here's the storyline uh many models are singular that is there are multiple points in the space of parameters that determine the same model and determine the truth that degeneracy matters because the learning process happens on the space of parameters not out in the world it's in your space of parameters maybe those parameters are the parameters that make up the neural network in your head or in your computer so it's it's maybe in some sense made up but it's how you learn you learn in a machine and a machine has numbers in it and those numbers are the parameters so the learning process happens on w and the learning process is strongly affected by this degeneracy and the way in which it's affected is that near some true parameters the learning process behaves differently and has a different relation to information and the number of bits need to specify a solution than it does at other true parameters so not all true parameters are created equal some are better by the occam's razor principle and that bitterness is related to the fact that the mapping is more mapping from parameters to models is more degenerate near some true parameters than others and all of this is what singular learning theory is about so singular learning theory proves that much of what i'm saying is correct of course i'm stating it in somewhat imprecise language but you can prove that more singular true parameters generalize better that's a theorem of what another which hopefully will prove okay and this suggests a strong role for geometry in learning because the nature of the singularity at w dagger is strongly related to the behavior of generalization of models that live near w dagger and all of that is explained by watanabe's theory okay so the purpose of this seminar is to examine this link between algebraic geometry and the foundations of statistical learning theory following watanabe's book and i think i'll stop there for questions thank you i'm just seeing some questions in the
chat okay is algebraic geometry prerequisite uh no i mean this this material is going to get pretty technical pretty fast because it's difficult to avoid that but in principle you don't need to know any algebraic geometry i have a question about the parameter space yep um when we were talking about like having multiple possible true models like in the parameter space and it sort of occurred to me that like maybe that only arises because you like chose too many interdependent variables i don't know if it's possible in this case but if we define the function f just in terms of a then there would be a single true parameter right that's right yeah so you can change you can change your model to eliminate these degeneracies that's right uh but then keep in mind that there's a there's a trade-off there right so if you change your model in that way it may no longer be the case that it's easy to i mean that may not be the model in which you actually want to learn so the search process for the true parameter is its own thing and it will work better with some parameterizations rather than others so you can you can mod out so to speak by these degeneracies uh but then you you may not effectively be able to search on that quotient and this is it's it's this is less subject to mathematical theory at the moment than the other part it's it's hard to say what exactly i mean by that but you could say the success of something like deep learning is is a very good illustration of that deep learning models are wildly singular so you might think well it would be much better to have a simpler smaller parameter space which is less degenerate and search on that but in practice no there are there are good reasons to search on a space which is larger on which the mapping has more degeneracy maybe the search process can work better there exactly why that is remains a bit mysterious but um yeah in principle you you can't always just eliminate the degeneracy uh by by making these identifications at least not without making something else somewhere something else more complicated i think there's also a practical answer to that which is uh that uh that the fact that you can see um a non-degenerate uh parameterization uh relates to the fact that you already have knowledge of of how the pendulum works but if you are just a machine a general kind of machine attempting to learn multiple things like a deep learning model uh you might not have that luxury right there's still a sort of like uh like if you reduce this alter just
the parameter a like there's still the job of finding the correct value for a that you would still have to learn that's right yeah and also um the the the property of being degenerate or degenerate or and also singular is a function also of what the truth is as well so if you use this model to learn other truth um it might not it might no longer be uh singular or degenerate right right right um thinking about dan what you were saying about for some reason sorry learning let me interrupt that question and just add something to your previous question otherwise i'm going to forget sorry to interrupt you um i think it's it's worth paying attention to the meaning of the parameters in this model these aren't just randomly chosen parameters right when you are studying a pendulum there is gravity and the coefficient g is completely different to the parameter t that relates to how long you paused before you took the measurement they're independent factors now when you have a model in which they are independent factors you have singularities you could change the model to mix the two physically that would be nonsense right to consider root gt as like being some fundamental physical thing doesn't make any sense but you could do it statistically it could be a single parameter in your model but then actually you're not modeling the real world very well because somebody could change t or the variations in t have a completely different origin to variations in g so if your true distribution were to change you would be completely screwed if you mixed root g and t you would be completely confused by a small change in t you would think maybe it you couldn't tell that apart from a change in gravity whereas if the parameters are independent and actually their own degrees of freedom then there are changes in the true parameter which you would be able to understand as being actually independent so there's a sense in which the real world does have independent degrees of freedom that you you can't just sort of you can't just identify them or quotient out by relations between them in the mathematics without paying a price somewhere okay so that's the point i want to make about that but what was your other question uh i'll i'll just reserve that question and i've got a question and like reply to what you just said um i then wonder like how like i imagine in many contexts like you have a problem to solve and you sort of like understand all of the variables that probably like determine what the model should be
and then so it's clear what the parameter space should be but i wonder like uh like all of this process of doing learning here seems like first you have to figure out what the right parameter space is and then you do learning um like is there any like uh theory of like figuring out the right parameters to use or like changing parameters like adding or eliminating parameters uh adding or eliminating parameters is a more subtle thing i guess but like model selection which model you should choose is a big part of statistics and the tools we're discussing uh provide means of in principle making those choices right if i were to say should i take a model with 100 parameters in this particular form or a different model with 10 parameters in this form there is a there are principled means of answering that question i mean maybe in practice you can't actually compute the answers but the principle comparisons exist uh yeah i don't know if you want to add to that admin uh yeah so model selection is definitely what we'll be looking uh we will look into modern selection uh using singular learning theory as well as um the usual uh statistics way of saying whether or not less for example the bayesian information quartering is kind of a usual way of trading of um more parameters versus how accurate your your um your model is at the moment with certain parameters for example so yeah that's definitely a concern of statistics i mean you could say that and you might go ahead i was just going to say that one way of thinking about it might be that um singular models sort of give you flexibility in the parameter space is maybe a word you could use so um a priori you don't necessarily know whether a 10 parameter model or 100 parameter model is going to best model your truth and the singular structure sort of allows that search process to happen across that span of um of different number of parameter models which is not okay not the case in regular models um so my reserved question was you you were talking about uh i was like for context i was talking about what about the the parameter space that's just like one dimensional and you were saying that uh learning tends to happen better on like this sort of three-dimensional parameter space then that like one-dimensional space um is that like somewhat because like with more dimensions you can sort of move around off uh the degeneracy or the singularity i don't know what you call it but like in the if you were like if the only degree of freedom was moving
along the line i don't know it strikes me as intuitive that like being able to move off the line and around in different like dimensions would sort of allow you to take more paths to like have more paths available to get to these to like a true parameter model yeah i guess i might have some intuition like that but it's it's very hard to i i don't know that i can really give any story like that that is that is not just mere what um yeah right so i don't really i guess there are many stories like that you might tell i don't know anything really convincing that i can tell to you which explains like in some simple way why having more parameters might be better i i will say something like this which is that um you know you could just artificially make your model more singular right i mean we could have in our model t prime and instead of t put t t prime there right add a parameter and just put in the product that'd be more singular but you you would be surprised if that really helped you any right now that's because the model you're taking there is more singular than the truth in some artificial way right the truth is that formula [Music] more so than it is the function the truth is the the actual generative process of physics right physics really has so we think gravity in it and time and those are different things sometimes and so that formula is is sort of actually what is going on and if you choose a model which is artificially more singular than what is really going on maybe it doesn't help so you can't just add parameters and sort of expect to magically do smarter things i guess scan was asking a similar question in the chat however many real world phenomena are way more complicated than any model you'll ever write down say english language and that's part of the reason why just adding insane numbers of parameters to deep learning models you never quite overshoot like that you don't have to be worried about overshooting and making your model more complicated than reality reality is more complicated than any model you'll be able to compute with in your data center so don't worry about that uh so in some sense you can expect in some regimes that merely adding parameters and making therefore your models probably more singular but maybe not might help but the storyline for why that is i we can tell you the mathematics which says to some degree why it is but the the storyline is sort of i don't know uh hinges on a deep understanding of resolution of singularities or something so it's not