Edmund introduced a new concept of strictly singular statistical models that have singularities in their KL divergence. RLCT Lambda is calculated by resolving these singularities, which is difficult in practice. Watanabe's remark provides an easier solution by replacing the KL divergence with a polynomial equivalence based on two non-negative analytic functions. This requires four conditions to be met, which can be verified using tools from algebraic geometry.

KL divergence is a measure of the distance between a model and the true distribution at a given parameter, and Edmund introduced strictly singular statistical models which have singularities in the zero set of KL divergence. RLCT Lambda is calculated by resolving these singularities, which can be difficult in practice. To make calculations easier, people replace KL divergence with an equivalent polynomial, which captures all the information needed to calculate RLCT. This equivalence is based on two non-negative analytic functions, h and k, which are considered equivalent if two positive constants c1 and c2 exist such that c1 times h is always less than or equal to k and k is always less than or equal to c2 times h.

Kullback-Leibler divergence measures the difference between two regression models, one being the true distribution and the other being the model. It can be replaced with an equivalent polynomial if certain conditions are met. Hilbert space L2xq is the set of functions on x which are square integrable with respect to the input distribution q. Guadalave's remark states that if the model can be written in a certain form, the Kullback-Leibler divergence is equivalent to a polynomial, although this has yet to be proven.

It is possible to write regression functions in a series form. If a sequence of real numbers is square summable and maps to a Hilbert space of functions that are square integrable with respect to the input distribution, then the Kullback-Leibler divergence can be bounded by a constant multiplied by the sum of the polynomials squared. This can be turned into a finite sum, and the proof is simple and follows from the assumptions.

The Hilbert Basis Theorem states that an infinite series of linearly independent polynomials can be written as a finite sum. Watanabe's Remark uses this theorem to show that a function tau is well-defined and bounded when the series of norms of all the polynomials squared converges. To replace the infinite series with a finite one, Watanabe uses the Public Basis Theorem, which states that any Hilbert space can be written as a finite sum of linearly independent polynomials. The Koshi-Schwarz inequality can be used to show that the infinite series is less than or equal to a sum of the squares of the polynomials multiplied by the sum of the generating set of polynomials each squared.

Watanabe's remark provides a polynomial upper bound for the KL divergence between two regression models. It requires four conditions to be met: the difference between the regression functions must be written as a series; the linear function tau must be well defined and bounded; the polynomial sequence must be in the sequence space l2; and for each value of k, the series of quotients must be less than some constant for all values of w. Spencer and Ariyagi have looked at related ideas, but Edmund's book requires a technical remark by Watanabe to justify it, which brings in more tools from algebraic geometry.

Liam talked about how the free energy and generalization behaviour of a statistical model is determined by the Real Log Canonical Threshold (RLCT), or Lambda. This talk will focus on the process of calculating Lambda, which involves taking a given model and true distribution, and finding the Quebec Labla divergence. The talk will also explore how replacing certain analytic functions and their zero sets with polynomials can be useful in the context of singular learning theory.

KL divergence is a measure of the distance between a model and the true distribution at a given parameter. Edmund introduced strictly singular statistical models, which have singularities in the zero set of KL divergence. RLCT Lambda is calculated by resolving these singularities, which can be difficult in practice. To make calculations easier, people replace KL divergence with an equivalent polynomial, which captures all the information needed to calculate RLCT.

Regression models are constructed from a function which maps an input to an output. The model's conditional distribution is given by an exponential which is based on the distance between the output and the function. Two non-negative analytic functions, h and k, are considered equivalent if there are two positive constants c1 and c2 such that c1 times h is always less than or equal to k and k is always less than or equal to c2 times h. This equivalence allows one to calculate the Relative Likelihood Criterion (RLCT) of a model without having to calculate the KL Divergence. To do this, one must find a simpler function which is equivalent to the KL Divergence. However, the process of finding such a function is not always clear.

Kullback-Leibler divergence measures the difference between two regression models, one being the true distribution and the other being the model. The difference between the two is written as a series of polynomials of weights and linearly independent functions of the input. Remark 7.6 states conditions which let one automatically replace the Kullback-Leibler divergence with an equivalent polynomial. Neural networks can also be represented by regression models, making them of interest.

Hilbert space L2xq is defined as the set of functions on x which are square integrable with respect to the input distribution q. An inner product between two functions g and h is given by integrating them with respect to q, and from this inner product a norm is derived. If p and q are regression models, the callback collide with divergence is the norm of the difference between those two functions. Guadalave's remark states that if the model can be written in a certain form, there exists a finite number j so that the callback level divergence is equivalent to a polynomial. However, this has yet to be proven.

The speaker was trying to prove the equivalence between KFW and polynomials. To do this, an upper and lower bound had to be found. It was initially thought to be straightforward, but over time it became more difficult. If the regression functions were orthonormal, the lower bound would fall out immediately, however this is not the case in actual examples. The upper bound has been able to be found with the inclusion of a few extra conditions.

Suppose that regression functions can be written in a series form. Consider a function, tau, which maps a sequence of real numbers that are square summable to a Hilbert space of functions that are square integrable with respect to the input distribution. If the sequence of polynomials, fj of w, is in l2 for all w, then there exists a constant c such that the kl divergence is less than or equal to c times the infinite series of the polynomials squared. This can be turned into a finite sum, and the proof of this lemma is simple and falls out of the assumptions.

The Hilbert Basis Theorem states that any Hilbert space can be written as an infinite series of linearly independent polynomials. Watanabe's Remark uses this theorem to show that a function tau is well-defined and bounded if the series of norms of all the polynomials squared converges. To replace the infinite series with a finite one, Watanabe uses the Public Basis Theorem, which states that any Hilbert space can be written as a finite sum of linearly independent polynomials.

Hilbert Basis Theorem states that an ideal generated by an infinite number of polynomials can be equal to an ideal generated by a finite number of polynomials. This allows us to reduce the infinite series of polynomials to a finite number. Using the expressions, we have an infinite series of the squares of the polynomials. Using the Koshi-Schwarz inequality, the series is less than or equal to a sum of the squares of the polynomials multiplied by the sum of the generating set of polynomials each squared. This can be written as a sum of the fk of w squared multiplied by a series on the right.

In order to bound the KL divergence between two regression models, Wadanabe's remark states that a polynomial upper bound can be used. This requires four conditions to be met: the difference between the regression functions must be written as a series; the linear function tau must be well defined and bounded; the polynomial sequence must be in the sequence space l2; and for each value of k, the series of quotients must be less than some constant for all values of w. Showing the fourth condition is difficult, but so far it has made up the rest of the thesis.

Spencer has extensively looked through the literature related to singular learning theory from 1999 to 2002, but there are still a few Watanabe papers yet to be read. Ariyagi has collaborated with Hobo Navi on calculating LCTs, often referencing a vague lemma by Watanabe to justify it. It is unclear if this lemma is attributed to Watanabe or just to folk knowledge in algebraic geometry.

In algebraic geometry, it is unusual to construct bounds on topics. Edmund has looked at related ideas in analytic geometry, but it does not seem to fill the gap. Spencer focuses on a technical remark in Watanabe, which is useful for calculating RCTs of a model by replacing the Chao divergence with a polynomial. This remark is significant as it brings in more tools from algebraic geometry to study problems and is necessary to justify the title of Edmund's book. It does not affect the soundness of the theory, but it is important for practical calculations.

okay we're ready go when you're ready sure all right um yeah hey everyone so uh last week i just talked about some background algebra talking about rings in the hilbert faces theorem so this week i'm gonna get into uh what pretty much makes up about the first half of my thesis um and so a lot of it will really be setting up the problem so the title of the talk or the series of talks is from analytic to algebraic so just to elaborate on what that's sort of talking about so analytic we'll be talking about analytic functions so functions whose taylor series conver uh calculate anywhere converge and analytic sets um so those are the points which make either an analytic function or collection of analytic functions equal to zero and so algebraically talking about polynomial functions and algebraic sets so and those are the points in space which make either a polynomial or a collection of polynomials equal to zero um so yeah so the my thesis is really about trying to take certain analytic functions and their zero sets and uh if possible replace them with polynomials um all sort of within the context of singular learning theory uh and so yeah during this talk i'll go through why uh that could be useful to be able to do that um so i'll just remind everyone of some of the stuff that uh liam talked about last week um which sort of sets up the whole problem that we'll be looking at today so uh given a statistical model and so that was a conditional probability distribution um which is parameter parameterized by a weight called w so at each value of a weight you get a different conditional distribution between the input x and the output y um and then and also a given a true distribution um which is written as which is also a conditional distribution which uh sort of which captures the correct relationship between the input and output um so then so given a model and true distribution the up william talked about last week how the free energy and in turn things like the generalization behavior of the model is determined by uh determined by a constant that liam introduced um called the rlct or the [Music] uh real log canonical threshold and so because of this calculate actually calculating lambda is one of the main ways you can try to understand a model uh so more specifically how you can understand how that model generalizes as it learns from new data um so i'll run through the sort of rough outline of how lambda is actually calculated so uh lambda is calculated from the quebec labla divergence

so i think a couple of weeks ago edmund introduced this quantity but i'll just remind everyone what it is um so this function sort of measures the distance between the model p and the true distribution q uh at each particular parameter so it's on this integral where you integrate over the input distribution and the true distribution and then you have this logarithm um okay so and also remember edmund a couple weeks ago introduced uh strictly singular statistical models that was defined through the fischer information matrix being degenerate at some points but for strictly singular models uh this func this function kfw um is often a complicated analytic function and it has um singularities in its zero set so the set of points which make k of w equal to zero which is also known as the set of true parameters um that set contains singularities i won't go through the precise definition for the singularity but i think that will be covered in future talks um so i'll just jump over to the next board and um so the rlct lambda is calculated by resolving these singularities so that's i think yeah it's quite a vague description of the process um but i believe in ken's talk later today he'll be talking about resolutions of singularities um which should sort of clarify what uh i actually mean by that um but from what i understand it the rough idea is to try and find a coordinate transformation which um sort of simplifies the form of the kl divergence um so the problem with this and i guess one of the reasons that actually using singular learning theory of the the techniques from it on actual models uh of the reasons it's quite difficult is that um in general this is hard so this process of resolving singularities it's based on uh a lot of it comes from haranaka's theorem on the resolution of singularities which is a hard parent to understand and even in simple examples the process can be quite difficult um so especially with um just a arbitrary or just some analytic function that you're trying to resolve the singularities for uh so in practice people so may including what now they do something else so instead uh what are they and and others who do singular learning theory calculations um replace kfw by an equivalent polynomial um so instead of taking this analytic function k w and directly uh computing the lct for it um people generally first try to find a polynomial which sort of captures all the information needed to calculate the rlct so i'll talk about what i mean by

equivalent specifically so two non-negative analytic functions so i'll call them um h of w and k of w so we call them equivalent um if there's two positive constants c1 and c2 such that uh you have the two following inequalities so c1 times h is always less than or equal to k which is always less than or equal to c2 times h so for anyone who's taken a course in metric spaces um this is a similar idea to uh lipschitz equivalent metrics or lipschitz equivalent norms and so the reason that we have this definition or care about this idea of equivalence is that if the two functions h and k are equivalent uh then they have the same rlct um [Music] so so if you didn't want to calculate the rlct for a model you can do that by trying to find a simpler function from the kl divergence which is equivalent to it and then just calculate the rlct using that function instead so i won't prove this llama here um so although this lamb is sort of used instead of like quite a lot of people it's actually hard to find a proof of it but i do know uh tom waring who is one of dan's master students who's recently finished has proven this lemma in his his thesis which uh so if you're interested i think that some can be found on dan's website um okay so then the question is uh given a model how can you find an equivalent polynomial and so this is the question that my thesis has been um looking at uh so as i said uh previously in in the literal in the literature people like watanabe uh often use this technique of replacing the kl divergence by a polynomial [Music] the issue is that no one ever really justifies what they're doing or they don't prove that it's equivalent or sort of talk about the process of how they found that polynomial um often people just state that a polynomial a polynomial is equivalent and just go on from there so if you're trying to do your own calculations looking at different models um it can be very difficult knowing how to actually try and find one of these polynomials um so i'll go over to the next board and introduce the types of models which uh we'll be starting so actually liam actually introduced these models last week so i'll just remind everyone of what regression models are so roughly a regression model is one which is constructed from a function uh the function which takes an input and maps it to an output so the [Music] model's conditional distribution is given by this exponential which is sort of like a normal distribution uh based on the distance between the output

y and the output of this function which is called the regression function and so that's that function is parameterized by the weight w um and so we also assume that the true distribution is given by the same type of model where the probability distribution is constructed from what we call the true function which is just written by f with this subscript t and so what the true function is meant to do is it's meant to take an input and correctly map it to an output so for example if the task was image classification the true function would be a function which takes an image and always correctly identifies what's in that image and so um so a couple slides ago i wrote out the uh definition of the quebec libra of the expression for the callback libra divergence um but when we're looking at regression models uh the kl divergence becomes sort of much more easier to interpret so if p and q are given above as above then the kl divergence uh ends up being just the mean square error between the regression function and e true function it's meant to be a minus sign in between those two so that the average is calculated with respect to the distribution which generates the inputs q of x um so in the when the models are given in this form the uh yeah the kale divergence really measures the sort of the distance between those two um functions so i also won't prove this lemma um i do know uh if you know i do know that a proof can be found in liam's thesis which is also on dan's website so um i guess yeah one of the reasons to be interested in regression models is that they include things like neural networks and so what a navi in his book he hints at a general method for taking regression models like this and finding an equivalent polynomial so he states that in a remark uh which i might actually go over to the next slide to write out um so yeah so what he calls remark 7.6 this remark states conditions which let you automatically or according to the remark at least let you automatically replace the chao divergence with a equivalent polynomial so what donave states in this remark that if p and q so the model and true distribution if they are regression models uh and if you can write the difference between the two regression functions as a series of the following form um so if you can write it as a series where each term is a product of a function of w and a function of x uh specifically where the functions of the weights the f are polynomials and the ejs are linearly independent

so then uh if you can do this what navi says then by the hover basis theorem there exists some finite uh number j so that kfw the callback level divergence is equivalent to this following polynomial so it's just a finite sum of squares of some of the fjs which come just directly from the series written above so this remark looks quite useful it says okay if you can write your model in this form you directly get this equivalent polynomial um so the issue is that guadalave doesn't actually prove this remark in his book or anywhere else as far as we can tell um [Music] and so we've uh this has really been the starting point of my thesis we started off trying to prove his remark and as we went along it realized it was actually quite difficult and so there's um as far as i can tell it doesn't hold in the generality stated here um and there's quite a few things that we're stuck on but we have been able to make a little bit of progress which has been what my thesis has been about so far um [Music] so i'll just talk introduce the uh setup for um the setup for how we try to analyze this problem um so we define the hilbert space capital l2 x and q so x being the uh input space for the variable uh little x and q being the input distribution q of x so we define this space as uh the set of functions uh on x which are square integrable with respect to the input distribution queue so specifically that's functions g such that if you integrate g of x squared multiplying by the input distribution this is finite um so this is a hilbert space which is equipped with an inner product so that inner product is given by so if you have two functions g and h the inner product between those two is just you integrate them multiply by each other and also against the input distribution queue so i'm going to then go back to the first board okay so uh from that inner product we get a norm so given a function g the norm squared or g i'll just write out the space it's on um because there'll be a couple other norms which come up um so the norm of g squared is u integrated g squared against q of x and so what this implies is that if p and q are regression models um then the if you remember the callback collide with divergence uh was just the mean square error between the two regression functions uh then when viewing everything is happening in this hilbert space that means that the kl divergence is the norm of the difference between those two functions um okay so uh so the starting point of us

trying to understand this remark was i guess i originally think it seemed fairly reasonable um and perhaps even straightforward to prove uh but over time so you kept thinking about or we realized it got harder and harder about outline just why it seems reasonable in the first place that it might be true so firstly to show equivalence uh between kfw and the polynomial that what anova right so we need to show both an upper bound and a lower bound so the upper bound being an expression of the form k of w being less than or equal to c c1 times this finite sum of those polynomials and the lower bound being a similar statement uh so you have another constant times that polynomial so if we add um that the regression functions set aside the assumption so specifically f of x w minus the true function so if those could be written as this series of polynomials and linearly independent functions uh if we assumed that these ejs weren't just linearly independent but actually were orthonormal um with respect to the inner product we introduced just before certain the lower bound would uh fall out almost immediately so reason that is is that the kl divergence is as we saw was the norm between the regression function and the true function squared and that's just the inner product of this difference with itself and suggest substituting in that series on both sides of the inner product um then because the ejs are all orthonormal so uh in a producting one with itself just gives you one and when you inner product two different e j of x's you get zero so this just equals the sum from j equals zero to infinity of the fj of w's squared um so that's automatically larger than any finite sum so if the ejs were orthonormal the lower bound would fall out straight away um [Music] the issue is that in any actual examples of models these uh hfx functions aren't orthonormal and so this doesn't work um and it turns out that the lower bound is actually the much harder direction to prove and that's one of the main things they're still stuck on but we haven't been able to i'll just go to the next board um we have been able to make some progress on the upper bounds um so i'll talk about what we've been able to find looking at those um so well uh yeah so i think as i mentioned before the water now basically mark 7.6 it's very it's a very general statement and doesn't really have much conditions so we've been able to find that if we add in a few extra conditions or things that we need to check then

um the we can prove that the upper bound actually works so [Music] um well so i'll intro start by introducing the following glamour so i'll just yeah go through the argument to get work out those conditions in a few different steps um but the first letter is to suppose so firstly that the regression functions can be written in the series form that whatever introduced uh then the second condition is that if you consider the function tau which goes from little l to r so that's the set of um sequences of real numbers which are square summable um so if you consider the function from that space to the hilbert space we introduced before of functions uh which is square integrable with respect to the input distribution queue so this function takes a sequence of real numbers um and maps it to the sum overall indices um of those real numbers where you use them as coefficients to the ej of x functions so if this is well defined so i should write if out the front um if this is well defined and bounded so if anyone has taken dan's metric and hilbert course they'll be familiar with the idea of bounded linear functions between norm spaces but the rough idea is that the output in some sense shouldn't be any bigger than the input so that holds for that function tau and lastly if we look at these polynomials in the series at the top of the page so the uh fj of w's then for each value of w you get an infinite sequence of real numbers so if this sequence is in little l2 um for all w in w so meaning uh specifically that the sum of all js of the fj of w's square if this is finite for each value of w then we can conclude that there exists a constant c so that k of w is less than or equal to c times the infinite series of the those polynomials all squared [Music] so this uh this upper bound that we obtain from this lemma still isn't quite a polynomial um because it's a oh yeah so it's an infinite sum of polynomials um so after this sum we'll look at how we can try and turn that into a finite sum so the proof of this lemma is very simple it really just falls out of all the assumptions quite directly so we have that the kl divergence as we saw before was the norm of um the difference between the two regression functions or more specifically the norm squared in that space um and so uh this series that we've got at the top here um that is uh if we look at it for a second we can see that that's just tau applied to the sequence of um sorry my drawing tablet's going a bit weird um the sequence of these polynomials

so this norm down the bottom uh equals the normal tau applied to that sequence and by the assumption that tau is bounded that just means that there's this real number c a positive real number c so this is less than or equal to c times the norm of a sequence of polynomials where this sequence is well sorry where this norm is the little l2 norm and specifically that just equals c times the sum over all the polynomials squared which completes that level so if we look at the three conditions that this uh lemma requires i think the third one looks fairly straightforward to check it's just checking the conversions of series i guess the first one's a bit weird um actually knowing what models can be written like this but we won't go into that today uh but the second condition i think looks the most uh at least the hardest check at first um so it looks sort of confusing to actually understand this map um but i'll go to the next board and just introduce a condition which guarantees that that function tau is well-defined and bounded so you have a lemma that um tau is well defined i i should say by well-defined i mean that the output of tau that the function that it outputs actually is a square integrable function so it's well defined and bounded if the following series converges so if you sum from j equals zero to infinity of the norms of all the linearly independent uh e j of x functions if you take the norm in the hilbert space we introduced and square them and some of them all if this series converges then um tao is automatically well defined and founded uh i won't prove this lemma just to save a bit of time but it's not it's actually not too difficult to show i think why i remember it just involves using the koshi schwartz schwarz inequality i think you didn't finish writing a statement of dilemma uh okay so so the map tau is well defined and bounded if this series yes yep okay um okay so and so far where we've got to is we've got that curved up we've been able to show when kfw is bounded by this infinite series of polynomials um but as we require a polynomial in the end we want to be able to replace that with just a finite sum of those polynomials um so in the statement of the remark watanabe talked about the public basis theorem um [Music] and [Music] um that is yeah the main tool that we'll use to try and write um your bound by a finite song so so i'll just remind everyone so last week when i talked about the hilbert basis theorem um so what that theorem says is that

every ideal in a polynomial ring is finitely generated so for us we have a sequence of um we have an infinite sequence of polynomials and so what the hilbert basis theorem says that says that the ideal generated by all of those infinitely many polynomials is equal to the ideal generated by some finite number of them and so specifically for all indices little j um there exists polynomials which um we write as using uh the note using a so aj 0 a j 1 of w all the way up to uh the final index in the finite generating sensor capital j um so there exists a collection of polynomials so that the little jth polynomial can be written as a sum of these a polynomials and the generating polynomials so specifically you can write f j of w as a little j zero w times f zero w plus a little j one of w f one of w all the way up to the last uh generating polynomial um okay so i'll move over to the uh next word again continue this argument um you said this already but this is this step of reducing from infinite to finite is one of the is the principal reason we want these coefficient functions f j to be polynomial right because the hilbert basis theorem certainly doesn't work if you just have analytic functions true yeah yeah that's a good point um okay so using these expressions we have this infinite series of um the squares of those polynomials uh but using those ex we can plug those expressions directly in and that so that gives us uh the following series sorry f0w and then the last polynomial in the generating set and so that's all squared uh but so what you can notice is that this um the expression that each term of the series can be written as a dot product between two vectors so specifically we can put the [Music] we can put the all the a polynomials in one vector and that's the dot product with all the f polynomials and that gives us out that expression above um but what this lets us use is the koshi schwarz inequality um which tells us that the norm of the dot product is smaller than the product of the norms of each individual vector i guess with some squares in there um so specifically that means this series is less than or equal to the series uh where each term is instead the sum of the um aj's squared multiplied by the sum over that generating set of polynomials each squared so immediately we can take this vector and pull it out the front so that equals the sum from k equals zero to j um f k of w squared times than this series on the right so i could write that as a

j 0 squared plus y dot 2. a j j w squared so then um [Music] if this series on the right or this is a big if and [Music] some some to infinity there right that's a good point yes yes yeah so that's an infinite series on the right uh but if it happens to be that this series not only converges for each value of w but happens to be bounded by some constant at that works for all values of the weight w then you get um an upper bound by a finite polynomial uh specifically the polynomial that wadanabe stated in his remark and so showing that this series uh outlanding green converges is i think the probably the hardest part of the calculation to actually check um [Music] but as we'll see in the next couple weeks there are certain models where we can do a direct calculation and show that it does converges and show that a upper bound actually works so i'll just go back to the first slide to conclude uh everything i'll just write out a list of guarantee that the the polynomial of the polynomial upper bound on the kl divergence works so [Music] to conclude if so firstly we have the condition from what in our base remark that the difference between the regression functions can be written as that series the second condition uh it's that of sorry before we the condition was that that linear function tau is well defined and bounded but then in the previous summer we saw something which guarantees that so specifically the series of the norms of those linearly independent hfx functions that siri if that series converges that guarantees that tao is well defined and bounded the third condition being that at each value of w the sequence of polynomials is in the sequence space little l2 and then the fourth condition which is the most difficult to actually check is that for each k equals zero up to j so this k index runs over the uh set generating set of polynomials from that we got from the hilbert basis theorem so if for each of those values of k the series of those quotients uh we square them all if that is less than infinity uh but well specifically if that's less than some constant c for all values of w then we can conclude that the kl divergence for those regression models is less than or equal to c times this polynomial um yeah and so i think as i mentioned earlier um showing this fourth condition is quite tricky and so that uh checking that condition on a few different models uh so far has made up the rest of my thesis and going through some of the simpler calculations will be

what uh the next couple of talks i'll be doing uh will be about um but yeah so that i'm gonna wrap up there but if anyone had any questions about that stuff uh yeah feel free to ask yeah thanks a lot questions that's i guess it's oh good oh go ahead no it's just uh going to say that as far as we can tell maybe there's some idea for how remark 7.6 might be proven obliquely mentioned in some paper perhaps but it does seem to be um yeah after exhaustively looking through the literature i think spencer has uh well maybe there's still a few watanabe papers you haven't read but it's still there's still a few i think so i think yeah i got to around maybe 2001 or 2002 um before i swapped over to just writing a draft of my thesis so there's still i think quite a few papers i could and it should still get back to um but certainly going back further in time yes 2001 i see what as far as i understand had a couple of papers from 1999 but not a lot before that in relation to singular learning theory so he was doing some work related to neural networks but um i don't think it was as theoretical so there's probably only a handful of papers only maybe like a couple before 2001. um yeah so i yeah i started i started it around the start and sort of saw quite a few of those papers it was sort of interesting seeing how over time he developed the whole singular learning theory um thing so i mean the issue i found with what a novice page is is he publishes lots of very similar papers yeah and so and occasionally there's like a little bit that's different in each one and so it's always tricky sort of scouring through them trying to find whether something's in there um what about so you said an exhaustive search through that literature but you're just talking about what's in our base literature so what about the more general literature um in algebra geometry uh so i haven't yeah i haven't actually looked at uh any artifact geometry papers i have looked at some papers by uh people who often collaborate with what are they so uh for example ariyagi has a few papers with hobo navi and calculating lcts and so she does a similar thing to watanabe where she uh often just says here's an equivalent polynomial and uh references this quite vague previous uh lemma by watanabe to justify it um so it's still yeah it's still quite unclear there'll be oh okay don't know about it yeah so it is actually attributed to watanabe rather than just to folk knowledge in algebraic geometry is that right yes yeah

um but so far we haven't understood what what would that be meant in the yeah in the things that's been referenced here yeah yeah yeah from the point of view of algebraic geometry it's a bit of an unusual combination of topics right one isn't usually constructing bounds on things first of all uh secondly the role of these solid spaces of course there's a lot of analysis in algebraic geometry but yeah there are some related ideas in analytic geometry that edmund's been looking at but they don't seem to also don't seem to fill this gap at least we haven't figured out how um maybe spencer if i can i could just ask you to so why why focus on this somewhat technical remark in watanabe um maybe it's something that's not proven but why care about it i mean what's the yeah so i guess yeah so it's useful uh significant please in my view because uh if you actually want to go about calculating an rct of a model which um it seems like you really do need to try and replace the chao divergence by a polynomial um [Music] i guess also um also firstly this remark seems to be a pretty general way of doing that um [Music] i think that's the main reason i think it's insignificant it i guess also does what you bring then if you can do that or something like that it does let you bring in a whole lot more tools from algebraic geometry to study problems wow i would say at a conceptual level it's without this remark it's actually hard to justify the title of his book [Music] right yeah which is algebraic geometry and statistical learning theory not analytical geometry analytic geometry rather um so without this reduction by the hilber basis theorem to a polynomial it's it's um hard to say it's a fundamental connection between learning and algebraic geometry per se i'm sorry that's right quite important does it affect um the soundness of the rest of the theory or is it just a a glimmer of hope that it might be easier than the fully general case to calculate these robts um so as far as i can tell i don't think it affects the soundness of the theory but it is definitely important if you want to do any actual calculations yeah so yeah would i be right to say something like um if this remark turned out to be false or turned out not to apply for a specific model then that wouldn't affect what geometry has to say about the model but it would affect how practical it was to calculate the rlct uh i think that's correct um yeah dan did you have any opinion on that yeah i think it's in some sense you could say it's a