WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


Deep learning models can be improved by increasing their depth and width. SGD is being studied to understand its relation to posterior and its connection to deep learning and the relationship between classical computer science algorithms and the singularities of analytic or algebraic varieties is being explored. Research into one hidden layer 10H networks is looking at non-linearities, biases, prior choice, and phase transitions. These could provide insights into why deep learning works across many true distributions and bridge the gap between programs and algorithms.

Short Summary


Deep learning theory is stuck due to impatience, but understanding the one hidden layer 10h test case is still valuable. Toric methods can be used to extend and improve upon Aoyagi and Martin Abe's work, and resolving the singularities and renormalizability is necessary. Liam Hodgkinson believes that the n goes to Infinity limit is not the most relevant for deep learning theory, and instead suggests three asymptotic regimes which have not been considered yet. These regimes could provide insight into understanding deep learning theory in the finite case.
Deep learning models can be improved by increasing their depth (D) and width (Lambda). Universal approximation results suggest that increasing the modulus of continuity of the function requires an increase in the depth of the model. Increasing the width to the sum of the input and output dimensions of the function can also help. Lambda is a function of the true distribution, proportional to the co-dimension and upper bounded by the number of orthogonal directions divided by two. A systematic study of the relationship between SGD and the Bayesian posterior for a one-layer 10h model is needed to understand why deep learning works across many true distributions. Empirical plots of trajectory with noise injected could be used to explore this relationship.
SGD is being studied to understand its relation to posterior and its connection to deep learning. Research is being done to understand the behavior of parallel training for renormalizable models, which could be related to the Power law behavior seen in training Transformer models. To understand this, one layer 10 H networks are being studied, taking into account renormalizability conditions, upper bounds for energy and rlct, and lower bounds or equalities. This could provide hints for further research.
The speaker discussed problems related to one hidden layer 10H networks, such as deeper networks, non-linearities and biases. These are open problems, with no clear idea how to do deeper networks. Prior choice when using a canonical prior and how it affects the mathematics was also discussed. Jet schemes and Arc schemes are mathematical tools used to understand the singular structure of a problem, and Bala Subramanian's paper on statistical inference and statistical mechanics on the space of probability distributions has an asymptotic expansion with higher order terms. James Clift, Tom Waring and James Woolbridge have explored this line of investigation, looking into non-linearities such as swish and using a Taylor series expansion.
Recent research has explored connections between classical computer science algorithms and the singularities of analytic or algebraic varieties. This could provide a way to understand the algorithms learned in neural networks. Phase transitions, algorithms inferred from data, and renormalization versus resolution of singularities are all relevant to bridging the gap between the old world of programs and the new world of algorithms. Research has shown phase transitions when small Transformers are trained on algorithmic tasks, and Watanabe's work provides a partial answer for non-renomizable or even non-unique cases. There is a potential connection between learning systems and thermodynamical dictionary.

Long Summary


Deep learning theory is stuck because people are impatient, but there is still value in understanding the one hidden layer 10h test case. To better understand this case, a list of general theoretical problems has been made. To access the boards, one must walk up to the orb and hold down E or click it and hold. Video recordings will be posted, but the boards cannot be exported as a PDF. There is no shame in thinking about one hidden layer 10h as it is something we understand, albeit not completely.
Toric methods can be applied systematically to one hidden layer 10h networks to extend and improve upon the work of Aoyagi and Martin Abe. Renormalizability in the sense of the asymptotic learning curve paper needs to be resolved, as well as finding a high road to resolving the singularities of one hidden layer 10h networks. This should be more general than just this example.
Liam Hodgkinson, a new faculty member, believes that the n goes to Infinity limit is not the most relevant for deep learning theory. Three asymptotic regimes are commonly considered: n goes to Infinity with d fixed, d goes to Infinity with n fixed, and both n and d going to Infinity. This last regime has not been considered yet, and may be more relevant for understanding deep learning theory. It has a random Matrix Theory flavor to it. These asymptotic results can inform what happens in the finite case, where d is usually much larger than n.
Increasing the depth (D) or width (Lambda) of a deep learning model can have different effects. Depending on how the width is increased, the resulting expression of Lambda can be different. Universal approximation results show that as the modulus of continuity of the function increases, the depth of the model needs to increase accordingly. Increasing the width to the sum of the input and output dimensions of the function can also help, but the effect of increasing n to infinity is not clear.
Lambda is a function of the true distribution, and it is proportional to the co-dimension and upper bounded by the number of orthogonal directions divided by two. The speaker is wondering why deep learning works across many true distributions and suggests that there needs to be a systematic study of the relationship between SGD and the Bayesian posterior for a one-layer 10h model, even in very simple examples. There are empirical plots of trajectory with noise injected, which could be used to explore the relationship between SGD and Bayesian posterior.
SGD is affected by singularities and the code dimensions of an RLCT. To verify its relation to the posterior, one can run SGD many times and chart its behaviour. To understand its connection to deep learning, one needs to consider limits and Bayesian generalization error. Questions such as how test loss relates to n remain unanswered and require further research. Edmund is currently working on this.
Research is being done to understand the possible behavior of parallel training for renormalizable models. As n approaches infinity, the Bayesian generalization error looks like Lambda on n, but there may be an emergent learning curve that looks like Lambda on gamma, where gamma is not equal to one. This could be related to the Power law behavior seen in training Transformer models. To understand this, one layer 10 H networks are being studied, taking into account renormalizability conditions, upper bounds for energy and rlct, and lower bounds or equalities. This could provide hints for further research.
The speaker discussed problems related to one hidden layer 10H networks. These included deeper networks, non-linearities, and biases, which is an open problem. Spencer has not yet solved the problem, and there is no clear idea how to do deeper networks. It is hoped that the work is mainly in the one hidden layer 10H case, and that there is an inductive step to understand how to do each additional layer. Typically, a p q and W have been picked for these problems.
The speaker discusses the choice of prior when using a canonical prior and how it can affect the mathematics. It is often chosen to make the mathematics tractable, but it still matters to the asymptotics. The speaker also suggests looking into other non-linearities, such as swish, which may be closer to the ReLU activation function and therefore appear more serious. It is unclear how different the mathematics looks when using a Taylor series expansion of swish as compared to tanh.
Jet schemes and Arc schemes are mathematical tools used to understand the singular structure of a problem. Bala Subramanian's paper on statistical inference and statistical mechanics on the space of probability distributions has an asymptotic expansion for the free energy with higher order terms which have a geometric content. This could potentially be used in the singular case to give a geometric interpretation of some of the terms in the formulas. This is an interesting question for both geometers and those interested in statistical learning theory. James Clift, Tom Waring and James Woolbridge have explored this line of investigation.
Knowledge from data is increasingly seen as an algorithm, such as the weights of a neural network. Tom Waring's thesis explains how to embed both classical computer science algorithms and the singularities of analytic or algebraic varieties into the same place. This could provide a dictionary for talking about the algorithms learned in neural networks, by understanding the structure of programs and what subroutines and composition mean at the level of singularities.
Phase transitions, algorithms inferred from data, and renormalization versus resolution of singularities are all relevant to bridging the old world of programs with the new world of algorithms. Recent output from anthropic and Chris Olas group show phase transitions when small Transformers are trained on algorithmic tasks. There is a potential connection between phase transitions and the emergence of symbolic algorithmic structure in these systems. Watanabe's work provides a partial answer for non-renomizable or even non-unique cases, and symmetry breaking may be important. There is a connection between learning systems and thermodynamical dictionary.
People have been applying statistical mechanics to learning theory for decades, but the link has mainly focused on the regular case and not the singular case. There are still open questions about how to use the thermodynamical dictionary between statistical mechanics and SLT, even in the regular case, and even more so in the singular case. There is potential to use SLT to view various things in physics in a more refined way, and to use the knowledge of singular models to improve algorithms.

Raw Transcript


all right so I think we'll move on to talking about some open problems uh yeah this is not meant to be an exhaustive list um but these are going to be the ones that seem to keep coming up in our discussion so uh it's maybe worth making a list foreign so I'm going to put the list in context by making the following statement which is that deep learning theory is stuck because people are impatient foreign in the field but certainly I think it is one of the problems but we are patient so should we tell Susan how to attach this oh sure yeah sorry Susan I should have given you a bit of a bit of a heads up how things work so uh the easiest way to listen is to if you if you see this orb up here at the front this kind of ball on the ground yeah so if you walk up to that you'll see a little prompt near your feet which says attaches listener do you see that yep okay so if you hold down e or just click it and hold um you'll oh got it so now you're you should hear me more clearly and if you click orb cam up the top of your screen you'll get a view of the boards that's kind of a convenient so this is how people are often watching yeah how do I go back to muting myself when I'm in this orb Camp yeah unfortunately you'll have to go out of orbcam mute and go back in sorry unless you got arcade I guess yeah we can't hook a hotkey into the mute yet they haven't that's not uh possible yeah unfortunately yeah that's a highly desired feature okay we're patient uh now that's one way of saying long live one layer 10 h oh just curious like what are the contents of the board be available like or should we like if should I be taking notes or there'll be a video recording that will be posted um the the boards there's no way to export them currently as like a PDF uh but yeah so the the video is the accessible thing outside the world and these boards stay here so you can always follow that link that link you used to get here will always point to this place and the boards will always have whatever they have when I finish oh great okay got it thanks yep so the first series of problems I'm going to list are problems the sort of General theoretical problems but the kind of uh test case in which to study them is one hidden layer 10 h and there's no shame in thinking about one hidden layer 10h so we understand this case well uh what we don't completely understand it but we understand something about it and there's still a lot of value to be gained from thinking more about it I think we shouldn't be in a hurry to move
Beyond it although obviously eventually that's what we aim to do I think one should be not afraid to be patient and just steadily work through this very Central and basic example so with that context so the following list of problems sort of refer to one hidden layer 10 h they're not in any particular order so the first one I'm going to list is to systematically apply toric methods so toric methods means um toric varieties or toric ideals this is a part of algebraic geometry toric varieties uh I won't give the definition it's not really the place to do that maybe some other time the sotoric varieties a it's a kind of combinatorial part of algebraic geometry uh toric varieties are easy to resolve in the sense of resolution of singularities or at least the the resolution process has a kind of combinatorial flavor it's sort of can be visualized in terms of polyhedra and subdividing faces and stuff like that uh the the varieties that arise as true parameter sets of true parameters in one layer 10h networks are tauric varieties or at least can be treated by these methods so there's an opportunity to kind of systematically adopt this point of view to redo and perhaps extend what aoyagi and Martin Abe did to compute rlcts of one hidden layer 10h Networks and I think this is a an excellent exercise to do as preparation for going Beyond one layer so if we want to understand deeper 10 H networks or other non-linearities I think getting a really solid grasp on this part we sort of understand now modulo some details how to go from analytic to algebraic for one hidden layer 10h that's great as a result of Spencer's thesis and uh and work that matters done and work admin has done so that's part of the story but there still remains I mean I have not read and understood the area paper uh and I don't really want to read that paper either because I'm not convinced that the way they're doing it is kind of the most elegant way so I think we should find the the uh how to say it the high road to uh resolving the singularities of one hidden layer 10h networks and I don't think that's currently in the literature and these historic methods should be more General than than this example as well so feel free to ask questions as I go through this list and we can go back and add details jump around but I'll just proceed if nobody asks me questions the second is one that we've discussed but we haven't resolved yet which is renormalizability in the precise sense of the um asymptotic learning curve paper that
Edmund was speaking about a month or so ago so the renormalizability condition for one layer 10h Networks I guess it's as I said all of this is implicitly kind of do it for one layer 10 H and then see what we can learn to do more General cases this is the unrealizable case obviously okay um three phases and phase transitions uh we're interested in this for varying true distributions phase transitions that happen as n goes to Infinity and I want to put here other limits um and maybe I'll add this is the next problem other limits not just n goes to Infinity I was talking to Liam yesterday Liam hodgkinson who's a a new faculty member in our school and he was pointing out that in his point of view the end goes to Infinity limit as the wrong one I'm not sure I completely agree with him but certainly we do believe that it's interesting to consider limits that look like for example uh the ratio between the data set size and the number of parameters constant and taking both of them to Infinity at the same rate so this kind of limit where we're varying the number of parameters we really haven't thought about at all and this is uh yeah I think probably quite important in understanding uh well we're not sure that the limit that we've studied the N goes to Infinity asymptotics are really the ones that are relevant to to deep learning theory it may be a helpful starting point but it may not actually we might be distracting ourselves by not considering other limits so this n and D both going to Infinity that's quite at a certain ratio it's quite a common um you know asymptotic regime in statistics the traditional one is just and going to Infinity with d fixed that this is also um it's also very interesting and probably more relevant um there's also like the D going to Infinity with n fixed Richie these are kind of the three asymptotic regimes that are are are considered and then both of these go into Infinity kind of has this like almost like a random Matrix Theory flavor to it but yeah I I don't think watch Nabi has considered at all this uh n and D both going to Infinity yeah do you have any feeling about uh it's a Liam's point of view seems to be quite strong but he's just convinced the end goes to Infinity limit D fixed is just not interesting uh um I suppose not convinced but that is somewhat valid because uh I mean we we want to use these asymptotic results to inform what happens in the finite case and if you're if your your practice is such that you know D is much larger than n
then then it's not clear large and asymptotics really buy you a lot because the behavior will be quite different to the limiting result you get when um n and D are both going to Infinity hmm um analyzes of this this other this you know n d both going to Infinity maybe we can ask Liam for some examples of of people doing this in deep learning successfully yeah I think that'd be a good idea isn't Lambda than UD anyway in SLT yeah uh I mean it's hard to send Lambda to Infinity when the first three well I don't know the first one or two um open questions are how to calculate the damn thing in the first place but yeah it seems like Anna Verlander is enticing yeah I guess you could say oh sorry go ahead um uh that's the point that sorry it depends on what we I think it crucially depends on what do we mean by increasing D for example are we adding um so it so if in the one layer case that's kind of kind of um uh very clear it's just add more notes but it seems like the the the relationship between D and Lambda is um is such that how you add extra modules like if you if you are adding an extra uh Transformer block attention Block versus adding a extra full layer of fully connected layer I mean that changes Lambda in very different ways um so yeah so it it does things I think that was an echo of your own voice Edmund oh okay right um yeah it doesn't like um D might be the correct um been to increase but if we almost want the resulting expression to be uh Lambda as a function of D or something like that yeah I think it's you're right that it does make a big difference whether you send the width to infinity or or the depth that fixed width to Infinity uh so we know from some of these Universal approximation results that roughly speaking if you keep the width to be something like the sum of the input Dimension and output dimension of the functions you're trying to approximate then you need to make the function deeper as the modulus of continuity of the function gets bigger or larger I forget which meaning that the the kind of rougher or the more fine the detail in the function the the more depth you need so you can imagine that I don't know how that relates to n going to Infinity though you kind of Imagine KN getting simpler as n goes to Infinity right so maybe maybe that doesn't that connection doesn't make a whole lot of sense but yeah it does does seem like uh not any way of increasing D is really the same limit different ways of adding nodes or parameters at different limits
uh yeah regarding Lambda I mean I guess Lambda is kind of the co-dimension right proportional to the co-dimension so it's kind of w0 sitting there and then you have the number of orthogonal directions divided by two is the thing that upper bounds Lambda so if you if you're adding lots of Dimensions that don't do anything then yeah you are increasing Lambda probably uh but yeah you're not in some sense maybe in some real sense you're not really changing the geometry so uh yeah I don't know I don't know how to think about that while we are in this context I kind of also want to point out something that I um kind of have a thinking area before which is the the the problem that we wanted to um solve or the phenomena to explain is why does deep learning work well for uh for different true distributions um and it's almost like across many true distributions and I kind of want to point out that Lambda is a function of the true distribution as well um it's so um yeah yeah um so it seems like there's a lot of subtext hidden in the um sentence and and then the quantity Infinity yeah I understand that this renormalizability condition is is needed for the sort of power law discussion or at least in our current point of view cool um okay I'll move on to the next problem I guess unless there's further comments about those four is the renewalizability condition for the real sorry renowned renormalizability for the realizable case like understood is that why he wrote a realizable implies renormalizable in this definition yeah yeah okay so the the question is only a question in the unrealizable case yeah uh sir problem five is SGD versus the posterior for one layer 10 h uh this is something that really we should have done some time ago I guess uh one could do this for much more complicated models but even for one layer 10h I don't understand what the relationship is between SGD GD and the Bayesian posterior even in very simple examples um so I feel like there's a kind of systematic study of this that is a kind of um foundational thing that needs to be done at some point um and then don't you have some empirical results on this where you just took like a two-dimensional one-layer tannage and and looked up you know the behavior of SGD as as like a process um versus yeah empirical yeah sorry again uh it's uh it's not okay so it's more like the launchable um path instead of SGD power so there's a noise artificial injected in that and it's more like I have empirical plots of trajectory but
um and I can say that they are affected by singularities but um and and obviously the sort of the rlct talk that the kind of things our city encodes like the code dimensions and things like that does affect um Behavior the launch upon um grilling descent Dynamics but beyond that Beyond saying that yeah it obviously changes I don't know how it changes but I mean there should be some way to verify empirically if SGD or you know take a simpler version of SGD you're like driven um uh process like it if if it's a good approximate sampler from the you know for the posterior like if you just run SGD many times and just keep track of where it ends up we'll just do a scatter plot and see if that looks like the two-dimensional one layer tan H posterior an even simpler thing that I don't know the answer to Is what if you just graph test loss against n I mean what does that mean exactly for STD is a little unclear but you know you train for some fixed time and then uh I don't even I mean if if SGD looks like the posterior then you might expect that and if there's a whole bunch of question marks around the relation between okay maybe we should just step back a minute and explain why this is a basic question so and an important question for any application of SLT to deep learning is we have various quantities in SLT like Bayesian generalization era that we have theorems about and you know asymptotics as n goes to Infinity well it could be that those results just have nothing to do with deep learning at all because the N goes to Infinity limit as the wrong limit and the Bayesian generalization error is the KL Divergence between the truth and the predictive distribution but that's just irrelevant to the test loss which is what people mean by generalization era when they're actually doing experiments in deep learning so to make these connections we we either have to suss out which parts of that need to be replaced by say other limits or some other kind of theory um or you know find ways of justifying these connections but we're really missing even the basic planks in that uh just testing basic propositions so I don't know for example even if you just kind of graph test loss against n does that look like something over n or not does it look like something over into a power or not I don't even know so um yeah Edmund's been working on this so I should say that many of these questions are things that you know we've discussed before and various combinations we're working on various subsets of these so it's not
like these are completely Virgin Territory but we I don't think we've answered any of these uh yet did you want to say more about this one Edmund uh no except I should get on just running those experiments yeah well that was it wasn't quite uh so Evan you know I think Liam hodgkinson might be a really good person to talk about but you know even just setting up some good experimental settings um yeah I I think just even for now like the the extra one you just suggested and the other experiment which is um uh sort of plotting STD test loss against um B so increasing D not somehow fixing um somehow fixing Lambda versus increasing Lambda flat out so the idea is that Tesla should be um should decrease as Lambda decreases not as B decreases um and that's kind of an evidence that the singularity is having effect on on SGD which which is like no one dispute that I think and that's sort of the uh the anari natural gradient like this has learned that since a while ago but um maybe sort of a direct influenza Lambda irct might be interesting but yeah I I yeah I should I have in my to-do to talk to um yeah I I think he's a smart guy well worth talking to so the sixth problem uh is parallels so trying to understand the possible uh kind of medium in Behavior I suppose so we know that as n goes to Infinity asymptotically for a renormalizable model that the Bayesian generalization error looks like uh Lambda on n but there's a sense in which maybe we can believe in a kind of emergent learning curve that looks something like Lambda on into the gamma for some gamma that's not equal to one and it's a question mark whether this is related to the uh Power law behavior that people see in training say Transformer models so what I'm proposing here is not to understand it for Transformer models that's too ambitious but we've discussed modular some of these considerations I was just pointing out with SGD versus the posterior and so on uh we sort of maybe have a bit of a lead on understanding how to think about power laws with exponents different to one for one layer 10 H networks subject to various other things being resolved for example the renormalizability condition um and replacing some upper bounds for kind of energy and rlct for these other phases by um kind of upper bounds and lower bounds or equalities there's a bunch of question marks here but this um in the category of things that understanding it for one layer 10 H networks would be hints towards uh good directions to pursue more generally this
seems high on my list of importance as we have discussed before okay that was it for one hidden layer 10 H networks sort of problems oriented towards that did anybody want to add to the list I've got some more General things um anybody want to add to the list for the problems that are kind of specific to one hidden layer 10h foreign so let's see [Music] other problems deeper Networks other non-linearities I could have had somewhat embarrassingly to that list biases which is an open problem an SLT unless Spencer was busy and didn't tell me uh is it solved problem Spencer uh no not yet I've done a little bit sort of following on from that uh suggestion Matt had um a little while ago um it's just like it could be so I should yeah try and write some stuff up for that okay let's let's clear then that Spencer has solved this so yeah it's not so um yeah so deeper networks uh say a deeper of one deeper 10 H networks is um completely open there are some upper bounds for the rlct stated in one of watanabe's papers I I'm not sure any of us have really understood to what degree they're justified in that paper I don't know if that's your state of knowledge as well Edmund I I kind of buy maybe what he's saying there but uh I'm far from understanding it and I I think it's not close to being uh yeah it's it's really just an up and bound probably a rather coarse one so I don't think at the moment there's um really a clear idea how to do deeper networks I'm hopeful that if we were sitting here really understanding how to do one hit in layer 10h with the toric methods that how to progress to deeper networks might might seem somewhat routine but um that that isn't uh that isn't the case obviously you might say well okay we're going to do one layer and then two layer and then every four years we'll increase the number of layers that would be a bit futile but that isn't what you'd hope to get out of this exercise you'd hope to understand how to do each additional layer by somehow uh with some technique reducing I mean more or less hoping the work is mainly in the one hidden layer 10 H case and then a kind of inductive step where you uh you really only have to do these two parts you do one hidden layer and then going from n to n plus one and that would be the ideal I don't know how realistic that is you could ask a very naive question about one layer 10 it works yeah it just occurred to me um when all this all these problems that we're doing have we like typically picked like a p q and W like I
think the true function is assumed to be zero and there's now I can assumed posterior and like an assumed distribution for the posterior and P and Q I mean prior because I guess canonical prior sorry yeah yeah the prior kind of gets chosen to be whatever it is that makes the mathematics tractable but uh okay often that's not crazy things I don't know maybe you would disagree with that Spencer but um yeah yeah I think with what I was doing I was just like I think what a love is done to do is just have like a uniform period on on like a some finite um like and at least in terms of the approaches we've taken to uh for example replacing the KL Divergence by an algebraic function the choice of Prior can be quite it's not a it's not a small thing like from some point of view you know as long as the prior is positive on w0 it doesn't matter to the asymptotics but from the point of view of actually finding the equivalent polynomial it seems like it makes a big difference or at least the mathematics looks rather different the integrals you do could be attractable or not depending on the form of the prior so I don't it seems like it both matters a lot and doesn't matter at all so I don't know quite how to uh how those two fit together regarding the true distributions that makes a difference yeah okay thank you I was wondering if it was like a gaussian that everyone knew and I just didn't um that would be a natural prior to choose except that her for other reasons it's very useful to take the The Domain in which your inputs live to be compact so that that rules out taking a guessing yeah uh other non-linearities so you know one could try things like swish I guess that was also on Spencer's to-do list at some point um I don't know how how hard that will be that seems unclear why would why would one care about other non-linearities while nobody actually trains neural networks with tanh so that gives the results on tan h a kind of antiquated feeling I suppose from a theoretical point of view a mathematical point of view it's not clear to me the Deep phenomena are really so dependent on the non-linearity and so maybe it's not too important but it is certainly something that if you can get it to work for swish then it's pretty close to relieu and and therefore Maybe looks more serious so I think it's worth investing some effort in I mean I don't know how different the mathematics looks when you have the Taylor series expansion of swish as compared to 10h maybe it's not so hard
to adapt what we did to swish maybe maybe it's very difficult yeah thanks Russell I'll see you next time okay so uh two uh jet schemes or Arc schemes so some of these problems are more in the direction of uh here's the kind of thing in algebraic geometer would would want to think about um so it's not really accurate to say it like this but roughly speaking you can think of jet schemes and Arc schemes as being a way of understanding the singular structure which is a bit more tractable or packaged in a different way than the full resolution and I my intuition is that the uh so if you look at the the Bala subramanian paper that a few of us are familiar with I forget the title uh that's right it's got Occam's razor in it yeah okay that's that's enough thanks uh statistical inference open razor statistical mechanics on the space of probability distributions so a model yeah thanks so much so in that paper there's an asymptotic expansion for the free energy which has the familiar first two terms that we we know from say Bic which I've reproduced in the wbic uh but there's higher order terms in the a regular case possibly also in the kind of minimally singular case or whatever we want to call it uh probably still works I'm not sure but certainly in the regular case you can see higher order terms uh which have a geometric content so they have an interpretation in terms of curvature and so on um and my impression is that you could potentially do something like that in the singular Case by making use of more sophisticated ideas from jet schemes Arc schemes it's just another word for another name for jet schemes so this this would maybe give a geometric interpretation of some of the terms that are kind of hidden in the op1 in the current way that we write those formulas I don't I don't know how much we care like like it's from a geometry point of view it's interesting from a statistical learning point of view I don't even know that anybody cares about the higher order terms in Bala subramanians paper uh someone can correct me if I'm missing the plot there but it just doesn't seem that important uh so maybe this isn't it's not obvious to me it's important from the point of view of the application the statistical learning theory but it's kind of a geometrically well-motivated like pure mathematically interesting question so I think I I like it um yeah this is a kind of line of investigation that I started with James Clift and Tom Waring James woolbridge which is uh which is kind of sitting there waiting
for a bit more attention which is there's a there's a kind of potential for a dictionary between so Watanabe is saying in the introduction to his book that you know knowledge is a knowledge to be learned from data is a singularity and we sort of understand to some degree what that means but the knowledge that's to be gained from data is increasingly well viewed as an algorithm right uh now maybe in classical statistics that's not how you would think about the parameters that you learned from data the parameters in a model that is the way it is but if you're learning a neural network then maybe the neural network encodes or certainly does encode an algorithm but maybe that's a good high level way of thinking about it so for example there's increasing evidence that the algorithms that are learned by Transformer models are interpretable so you can look at the weights and see oh yeah okay so it's basically doing this and this is a list of operations that you would understand as an algorithm that a human might execute in that context to perform that task now the the algorithm is represented in in some list of Weights but it's not just some crazy linear algebra that has no clear like symbolic content uh one should be careful reading too much into these things potentially But as time goes by it's um it's becoming more justifiable to think about the results of learning in some of these large models as being algorithmic and character okay so how seriously to take that connection between algorithms in the sort of classical computer science setting and uh the singularities that are the knowledge that is learned from data when the model is a neural network and so one way to investigate that connection is to uh to find a way of embedding Both Worlds into the same place so what we did with um what Tom Waring did in his thesis for example was explain how to go from a turing machine use it or rather a code for a turing machine on a universal turing machine how to use that to produce a learning problem a synthesis problem such that the original turing machine corresponds to a a point in w0 of that model so a singularity of of that um analytic or algebraic variety now if and we understand the structure of programs we know what subroutines mean we know what composition means if we can understand what those structures on programs mean at the level of singularities then it may give a dictionary for talking about the kinds of algorithms that we learn in neural networks so this program's a singularities idea is a way
of Bridging the old world of programs Turing machines or code with the new world of programs algorithms that are inferred from data so this is not a specific problem necessarily but there's a there's a list of problems associated to this that are in this pas1 and pas2 are a pair of handwritten notes that some of you have um if you're interested in that I can can see more that that's sort of relevant to the the earlier discussion about phase transitions so there are some very interesting plots if you look at the recent output of anthropic and Chris Olas group there are some very interesting plots where they show as they train small Transformers on kind of algorithmic tasks they're a clear phase transitions when these algorithmic structures appear in the weights so there's a very interesting potentially quite deep link between phase Transitions and the emergence of kind of symbolic algorithmic structure in these systems and ultimately this is kind of the reason I care about phase transitions and maybe I'll list one more um which is again an area of Investigation rather than a specific problem like the earlier ones which is renormalization versus resolution of singularities and this uh this we've been discussing in the abstraction seminar so maybe I'll just defer to that seminar for future investigation of that topic wait that's all I had um Edmund what would you add Susan what would you add One technical funds which is um what happened for non-renaline non-renomizable and or even none essentially unique cases um we have kind of a partial answer in one of the paper by Watanabe for uh for restricted case right yeah there's also this cryptic comment about symmetry breaking and what another's work where he says this will be very important and then that's the last we ever hear on that topic from him oh yeah no worries Susan I can if you have problems you want me to add to this I can I can come back and put them in here later um yeah that's a good one is that that's kind of a question about how Universal the equations of state are I guess right like what is the proper class of um models is that right yeah speaking of that through a more conceptual one would be um put it a little bit um I think this is what I know this equation of state but so um on less frequently uh what is the connection between learning um systems and um physics yeah let me say thermodynamical dictionary I'll just clarify that with with singularities because of course there is a dictionary that relates statistical
learning theory to to thermodynamics but from our point of view it's it's incomplete foreign says she's going to maybe present some problems to do with variational inference in SLT later that would be excellent yeah thanks I think there is kind of a broader heading um which is I guess translational work as in um uh how does the knowledge about singularities in um learning problem how does that translate into improving um algorithms and um so so the version of inference like how do we how do we use So like um students work as has that flavor of now that we know um singular models have so-and-so structure and it seems like um knowing that could improve algorithms foreign people want to ask about the problems we discussed earlier is the thermodynamical dictionary like the thing between statistical mechanics and SLT um yeah yeah it's in Liam's thesis and it's in some of my notes but this is an old topic I mean people have been talking people have been applying statistical mechanics to learning theory for many decades uh so it's certainly not something we came up with however that that link has kind of over emphasized what we would call the regular case right where there are um you can I mean that's what balasubramanian's paper is is about for example this is an exposition of that linkage which is not invented by him and he's adding to it in some ways uh but a lot of that has presumed the regular case and so suffers in the same way as much of this does from a lack of attention to the singular case so there are still open questions about how to think about this dictionary I would say even in the regular case I'm not convinced that people really have thought through completely like what uh or at least I don't understand um how to think about everything I know in thermodynamics that seems appropriate to translate how to think about that in a learning theory context so maybe there's work to do even there but certainly in the singular case it's uh there are we could probably do a whole seminar and just what we don't understand that's kind of in the singular case specifically um yeah that goes the other way as well that's kind of what um maybe they could go under the heading of what Edmund just said so there are various things in physics that you could you could just see from the point of view of SLT you you could think about them you know a more refined way which is uh maybe a way of saying that everybody should bow before hiranaka and study his works see um yeah maybe maybe it's worth separating