WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


Statistical inference allows us to approximate an unknown true distribution by learning from a dataset. Point estimation is used to find the best parameter based on a measure such as the likelihood function. Fischer information matrix is used to approximate the estimator distribution, and Maximum Likelihood Estimation is an unbiased estimator with precision given by the inverse of the Fisher Information. KL divergence is a measure of the distance between two probability distributions, and the empirical log likelihood ratio is the ratio between the true distribution and the model. Realizability condition ensures that a set of parameters exists that produces a model equal to the true distribution.

Short Summary


Statistical inference involves a true distribution which generates a dataset and a learning process which produces an approximation to the true distribution. Point estimation is a method of choosing the best parameter according to a measure, such as the likelihood function, and is a random variable due to the random training samples produced. The estimator distribution is usually a binomial distribution, which is approximated by the normal distribution when the number of trials is large. However, if the true probability is extremely biased, the sample size needs to be large enough for the normal distribution to be sharp and accurate.
Fischer information matrix is used to approximate the distribution of the maximum likelihood estimator when the true distribution is unknown. It is a covariance matrix calculated from the data sampled from the model with a given parameter. Asymptotic normality is a property of maximum likelihood estimators of regular models, and the covariance of a vector of partial derivatives of the log likelihood is given by a matrix. An example is the coin source model, which has a probability of heads parametrized by w, with the computation giving the probability of heads.
MLE is an unbiased estimator of the true parameter, with a precision equal to 1/n multiplied by the inverse of the Fisher Information. A model is regular if it satisfies non-degeneracy and the Fisher Information Matrix of the model is everywhere invertible. An example of a singular model is when a person tosses coins on the other side of the door, even though the same sequence of heads and tails is observed. A two parameter model is used to describe a situation where a person is randomly choosing one of two coins from a box, and the facial information matrix for this model is given by w minus half square one over a, with a degeneracy when w is equal to half.
KL divergence is a measure of the distance between two probability distributions. It is always positive and is zero when the distributions are equal. The facial information matrix is related to KL divergence, which is the second derivative of the true parameter, and is a hessian metric where the distance between two parameters does not satisfy the triangle inequality. The empirical log likelihood ratio is the ratio between the true distribution and the model, and its expectation is the KL divergence. The realizability condition states that there is a set of parameters that produce a model equal to the true distribution.

Long Summary


A coin flipping person gave a bunch of observations and the task is to estimate the probability of landing heads when the coin is thrown again. Statistical inference involves a true distribution which generates a dataset and a learning process which produces an approximation to the true distribution. The learning process involves defining a model, a family of probability distributions parameterized by a parameter, and a likelihood function which is a function of the parameter. The maximum likelihood estimate is found by maximizing the likelihood function, producing an approximation to the model.
Point estimation is a method of choosing the best parameter according to a measure, such as the likelihood function, and is a random variable due to the random training samples produced by the true distribution. The estimator distribution is usually a binomial distribution, which is approximated by the normal distribution when the number of trials is large. However, if the true probability is extremely biased, the normal approximation is not a good estimate when the sample size is not large. In this case, the sample size needs to be large enough for the normal distribution to be sharp and accurate.
Fischer information matrix is used to approximate the distribution of the maximum likelihood estimator when the true distribution is not known. It is a covariance matrix calculated over the data sampled from the model with a given parameter. Pathological points can influence the difficulty of the distribution estimation by a normal approximation. Asymptotic normality is a property of maximum likelihood estimators of regular models.
The covariance of a vector of partial derivatives of the log likelihood is given by a matrix. In one dimension, this reduces to a scalar function which is equal to the integral of the sum of the distribution at w. This expression is the expected value of the second derivative of the log likelihood, which gives the expected curvature. As an example, the coin source model has a probability of heads parametrized by w, which is the probability of the coin landing heads, and one minus w is the probability of the coin landing tails. The computation gives the probability of heads.
MLE is an unbiased estimator of the true parameter, whose variance is equal to 1/n multiplied by the inverse of the Fisher Information. This inverse is known as the precision, and is the covariance of the normal distribution when the regularity conditions are satisfied. The Fisher Information is a scalar function with an asymptote near the endpoint, and is not defined at the end point.
A model is regular if it satisfies two conditions: non-degeneracy, which requires that different model parameters produce different distributions of x; and the Fisher Information Matrix of the model must be everywhere invertible, meaning that all eigenvalues of the matrix must be strictly greater than zero. A model is strictly singular if it violates either of these two conditions. An example of a singular model is a person tossing coins on the other side of the door, even though the same sequence of heads and tails is observed.
A two parameter model is used to describe a situation where a person is randomly choosing one of two coins from a box, one of which is a fair coin and one of which is a biased coin. The probability of landing heads is equal to the probability of choosing the fair coin multiplied by a half, plus the probability of choosing the other coin multiplied by the probability of landing heads (measured by w). The facial information matrix for this model is given by w minus half square one over a, with a degeneracy when w is equal to half, resulting in the only non-zero term being the off-diagonal component. This example was taken from Shall We Len's thesis.
KL divergence is an important quantity related to visual information. It is the difference in surprise when an event is observed from one distribution compared to another, and is denoted as KL divergence of distribution P from distribution Q. It is always positive and is zero when the distributions are equal. It is related to the facial information matrix, where the second derivative of the true parameter is equal to the KL divergence.
KL divergence is a measure of the distance between two distributions. The realizability assumption states that there is a parameter, w0, whose model is equal to the true distribution. The facial information is equal to the curvature of the KL divergence function at the true parameter. This can be interpreted as a hessian metric, where the distance between two parameters, w1 and w2, is not the same as the distance between w2 and w1, and does not satisfy the triangle inequality.
KL divergence is a measure of distance between a true distribution (q) and a model (w). The empirical log likelihood ratio is the ratio between q and w, and its expectation is the KL divergence. The realizability condition states that there is a set of parameters (w0) that produce a model that equals the true distribution q. This set of parameters is equal to the zero set due to the properties of KL divergence.
Information geometry and statistical learning theory are two fields that share a basic setup, but have different focuses. Information geometry focuses on the effects of curvature, while in statistical learning theory, the focus is on where the interface information metric goes wrong. Singularity of a model is defined as the zero set of the KL divergence of the model and it can determine learning behavior. Watanabe's work is aware of information geometry, as the author thanked Professor Shinichi Amari in the preface of their book.
Information geometry and singularity resolution are related topics, but there is no direct connection between them. Information geometry is not equipped to handle degenerate metrics, and so tools from information geometry may not be applicable to singularity resolution. There is a supplementary session with Liam and Edmund to discuss singularity learning theory. Edmund is also available to answer questions about singularity theory.

Raw Transcript


um so this is slt single learning theory um seminar session and the main session two um today uh it will be mainly about um the following things that you see in the first part there so my goal is to motivate and introduce facial information chao divergence and give the definition to what singular models are and hopefully every time we we just introduce bayesian statistics um but probably not and and throughout we will hint at the role of geometry in learning so throughout please feel free to uh comment um in the comment box or something to type out for your comments or just interrupt me okay with that i think i'll just move immediately to the second boat okay so i think i'll pick up from um the supplementary session last time so if you recall we were talking about um a coin flipping person on the outside of the door and he gave us a bunch of observations a bunch of coin flips and our task is to estimate the probability of landing heads um when he throw that coin again um well we see that statistical influence involves the following object so we have what we call the true distribution or sometimes known as the generating process it generates a bunch of observations which collectively will call the data set and then through some learning process something called um that we shall call p-hat is produced which is supposed to be an approximation to the truth to the true distribution and we saw that one way to have such a learning process is to first define a model which is a probability a model is a probability distribution a family of probability distribution parameterized by a parameter living in some parameter space w and together with the data set and the model we can define something called the likelihood function which is a function of the parameter it tells us for each parameter what is the likelihood of observing the data we observe and here we have used the iid which is identically an independently distributed sample so meaning each sample comes we're assuming come from the same distribution and they are independent hence the factorization so the learning process with this model would be something akin to finding the maximum likelihood estimate uh which is just at max so the maximum w the parameter the parameter w that maximizes the likelihood function in which case the mle approximation to to to solve this uh learning problem to solve the statistical inference problem would be just a step p hat the approximation that we want to be the model at the mle and
setting setting the approximating distribution to be exactly a model at a point to choose a parameter is called in general uh point estimation because we are choosing a point in the um parameter space the best point of some sort according to some measure and the measure in this case is just the likelihood function okay let's move to the next board i'm hoping the ops is following okay okay we also see that oops this point estimation mle or otherwise um which we call the estimator because it did estimate the row is its row is to estimate the parameters is a random variable because it depends on the random training samples produced by the true distribution which we call the end this thing is random and um w hat estimator depends on it hence it is itself random and therefore it has a distribution which is usually called the estimated distribution so in the coin flip case we saw that the estimator distribution was um for something like i don't know why that line got in getting um is actually a binomial distribution and by the normal final theorem that says that when the number of binomial trials becomes larger and larger the binomial distribution actually gets approximated by the normal distribution very well however if you actually do the sampling and do this experiments a lot of a lot of times and for different true value meaning for example if the coin is extremely biased versus if it is not very biased let me draw that out so um in the completely fair coin case you will be um you you might perhaps generate say um 50 50 50 observations and the normal approximation is already a really good approximation even at even near the tail end however if you use the same uh if if the coin were a very biased coin for example its true probability of landing head is 0.01 then your data set if you only have 10 flips most of it will probably be um tails in which case um your your binomial distribution will be very very skewed and the normal distribution it will still be a fairly good approximation near the mean near the peak but then if you don't have enough sample the normal distribution will have very very low variance and hence it will extend into like the impossible region which means the normal approximation is a really bad estimate when n is not not very large you need as you go in this direction you need larger sample size to get um good normal approx summation um so you need the sample size to be big enough that the normal distribution becomes that sharp and even worse near the end point here
for example not for example it's precisely the zero and one endpoint so if your coin is literally both heads of both tails then there is no no normal approximation possible so those are the pathological points that that should concern concerners in singular learning theory and we see that even when the true distribution is not pathological like it is not at the pathological points the pathological points kind of reach out to its neighborhood and influences um how hard uh how difficult and uh the distribution can be estimated by a normal approximation so that's kind of a hint at pathological points or as we will see later on which we shall call singularity um influencing uh the behave the learning behavior on things um some comments on the on this special case of the quantity models that this situation with the mle being um asymptotically oops i'm symmetrically normal so this property of being able to be approximated by a normal distribution as the sample size gets larger and larger is called asymptotic normality um this is not uh this is not a cause coincidence this is um this this actually holds this whole this property holds for mle so maximum likelihood estimator of regular model or something called regular models but this is where we shall define something called official information matrix to see what regular models are um okay let's move to the next board is there any question at this stage the outcome is very skewed just come back to your left a little bit all right come back to the left yeah okay uh that's too much too fun just stop okay that that look that look good okay cool okay cool um so even in just slightly more complicated situation like slightly more complicated than the bernoulli variable that we're dealing with in the coin toast case we won't have the luxury of computing the estimated distribution which turned out to be the binomial distribution in the quantities case directly which means to find asymptotic approximation to find the asymptotic distribution of the estimator we need some some other machinery so this is where let us define the fischer information matrix the facial information matrix of a given model is the expression we denote i the function of the parameter it's equal to i'm going to write out the the idea first is equal to the covariance matrix where the covariance is taken taken over as if the data is sampled from the model sample from the distribution at the model parameter prescribed by the input um and it's a convenience
so interrupt could you just move closer to the wall just stand closer to it close that okay where's the oh over here yeah you see if you just stand in front of it people will be able to hear your best okay cool this is better yeah i think so okay cool thanks okay so it's a covariance of using the distribution at w of the following vector so it's a vector of partial derivatives of the log likelihood so let's complete the definition first before i give the the intuition so in component form um the expression so this is a matrix the comp the i j component is given by the integral of the sum if you are dealing with discrete random variables so that's this distribution at w so in one dimension just to give us some intuition about this thing uh in one dimension this reduced to a scalar function so it's a one by one matrix so it's a scalar function which is equal to this expression so this is saying so assume that the data comes from the distribution there and we ask what is the variance because this is the diagonal component of covariance matrix is the variance of what is the variance of this um gradient the gradient of the log likelihood so recall that we use this expression to get mle the condition for at least a necessary composition for uh for a parameter to be the maximum likelihood is that this is zero to the local maxima is that this is zero um we use this to get to derive an expression for the mle but perhaps even more even better for intuition would be if the log likelihood is nice it's sufficiently nice um then we actually have a better expression um for this which is equal to the negative of the second derivatives so this is not the square of the derivative anymore this is the second derivative of the log likelihood so what is this expression this expression is the expected um expected value of the second derivative so perhaps a better way of saying that this is um again assuming that the data is coming from the true parameter what is the expected curvature of the log like um let's move to the first boat again and i'll give a few examples of the official information okay let's give the example using our using the model that we have been discussing throughout which is the coin source model so recall that the probability of heads at the model parametrized by w is just given by w where w is just thought of as the probability of the coin landing heads and which forces the probability of landing tails to be one minus w and that if you go through the computation you get
the facial information matrix in this case is equal to this expression if you plot the graph so that's the w axis this is the um official information which is a scalar function in this case there's an asymptote near the endpoint it's quite flat there that's the function right and notice that it is not defined it's not defined at the end point and also notice something quite important so last week we derived that the estimator converges in distribution to the normal distribution with um mean equal to the true parameter so that's the unbiased statement that the mle is an unbiased estimator of whatever the true value is and the variance for the variance was computed last week using the variance of the binomial distribution and it's equal to this expression so that's the variance and we notice that this expression is actually equal to one over n multiplied by the inverse of the fischer information so there's a kind of a name for the inverse of something being the variance which is called the precision so the facial information can be integrated as a precision of the estimator but to be more precise it is the asymptotic precision so this is actually not a coincidence again because in regular model mle is asymptotically efficient so this property it's saying that as an estimator to estimate w node actually let me write this as um oops that's too much anyway let me uh go to the next bowl and write this as a as a theorem okay so theorem so under some regularity condition which is a little too complex to state in its full generality but you should think of um what are the various different um assumptions that we made when we calculate our maximum likelihood estimate as we did last week for example either the logarithmical function should at least be differentiable under some regularly condition we have the following convergence so this is for any model this is not the coin toss model anymore so this is just any maximum likelihood estimate for some model of some truth which we also assume realizability meaning the true distribution the generating process is actually part of the family of model under those condition this random variable converges in distribution to the normal distribution with mean zero and the picture information matrix in this case um as the precision matrix which means the inverse of the matrix so that's the inverse is the covariance of the normal distribution this is multi-multivariate normal so even under those regular even if some of the regularity conditions are satisfied
we still need something for this statement to even make sense which is that the inverse exists we need this to be true and this is perhaps the more important condition in what we meant by a regular model so definition a model is regular if it satisfies two conditions one is the non-degeneracy condition that dan was alluding to last week and it is phrased as it is the identifiability of the model if it is identifiable this means that if you have two different model parameters you should have different model different parameters parameterize different models so um so not equal so not equal as distribution of x so it could be equal um out uh outside of the zero set for example sorry um i meant it's not equal as distribution so at the con the condition is just for um a set of outside of the set of measure 0 is fine if it's assumed continuous then this is just not equal as function as function of x the second condition is the condition that we were just talking about which is that the facial information matrix of this model is everywhere invertible this is actually not how watanabe stated in single leveling theory but this is equivalent to the condition that the fission information matrix is everywhere positive definite so if you recall that the matrix itself is um is a covariance matrix which means that it is it has real entries and it is symmetric um and because of that we have real eigenvalues it is diagonalizable so and also you can prove that because it's a covariance matrix it is definitely positive semi-definite which means that all its eigenvalues are greater than or equal to zero it's non-negative um but we do need uh for a model to be regular you need it to be definite meanings meaning that all the eigenvalues are strictly greater than zero which is for um for positive sense of in the family of positive seven definite matrix that those are precisely the invertible uh matrices and here we can define a model we can define what singular models are a model is called strictly singular if it is not regular so if it violates one of those two conditions and singular models is just an umbrella term that encompass both the regular model and strictly singular models any question at this stage let's move to a new board okay let's give an example of a singular model we um we saw an example last time where we could have a more com complex model of the person tossing coins on the other side of the door even though we observed the exactly the same sequence of head and tail
tells we could assume that he is doing something nefarious and he's actually choosing one one of two coins from a box at random one of which is a spare coin and one of which is the bias coin and our model in that case uh become a two parameter models so um we have two parameters where s is the uniform the probability of choosing the fair coin which means that the probability of landing heads is well first the probability of choosing the fair coin and then multiplied by a half which is the probability of landing heads in a fair coin plus the probability of choosing the other coin which have probability of landing heads um measured by w and conversely not conversely on the other hand the probability of lending tails is just 1 minus the expression or you can just work out the logic again it's this expression and note that this is not identifiable meaning it has some degeneracy in particular if um the supposedly biased coin in the box is also it actually have probability of landing heads of w equal to half meaning it is itself a fair coin then this model collapsed down to just a fair coin model um this is just saying that if w is a half then the probability of landing heads is the same as the probability of landing tails which is equal to a half for every value of the parameter s in the in its own parameter space so this is the degenerate condition so you we have um two parameters one is w one is s um this line of that equal to half all parameterize the same model and i spend some time computing the facial information matrix in this case so if we call this expression a and this expression b because i don't want to write them out a lot this facial information matrix is actually given by w minus half square one over a and because it's symmetric i won't write this out it's just the same as the off-diagonal component so that's the facial information matrix and you observe that on that line of degeneracy when w is equal to half this so that term becomes zero um and then when w equal to half the expression a and b are actually the same thing so this is zero so the only non-zero term is this term um if anyone have time please do check my calculation and see that this is correct because that someone um so this example is taken from um shall we len thesis and he actually got a different answer from them than i did so i haven't checked mine he got aid here but someone please to check let's move to the second part so this is an example of a degenerate phishing information matrix
at some parameter moving to the next board any question about this sofa all right i think i can move too far again oops come back okay stop stop okay that's that's fine i guess if there's no question um this is perhaps a good time to introduce another very important quantity which um which i shall explain what what its connection to the visual information is which is called the coolback liver lip blur divergence aka the kl divergence so it's between two distribution um p and q but it's it's not it's not a symmetric function between the two distributions so we usually phrase it as the the kl divergence of a distribution p of x from another distribution q of x denoted as such is equal to the expected and expectation is taken over q the first distribution of the following expression if anyone is familiar with information theory recall that the log of a probability of an event is sometimes known as the surprise of the event um for example if i tell you that um the sun will rise in the east tomorrow morning that is not a very surprising event and therefore the information content from someone telling you that the sun has indeed rise in the east um is a very low value piece of observation so the information is uh low in that case um and the log of that is um um uh it's very in that case the surprise is very low um so this is kind of the difference in surprise um with observation taken from q so this is like saying if you have um if the true if the true event generator is q but your belief is that you are modeling it with p what is your excess surprise um that's kind of just the intuition of it and it's equal to the expression of if you write out is equal to this expression and some important properties is that it is always positive non-negative and it is zero if and only if um q is equal to p um well almost everywhere um well it means that they are the same as distributions um i think throughout we will assume that our distributions are at least continuous which means this is just equality of function of x um any question about this okay uh if not well i'll mention its connection to the facial information matrix so recall that the fission information let me just um write it here again if w naught is the true parameter um and again assuming that the log likelihood function is nice it's sufficiently nice then this reduced to the second derivative i'm going to use the one variable case um for now um so it is that and um sorry those are w0 and if the true distribution is actually realizable and
is equal to um the model at w0 then this is q of x and the difference between if um and assuming also that you can swap um expectation and derivative that's um that's that apology about the various assumption but you can prove this in some nice situation with dominated convergence this is not too strong a condition so if you can sort that then you can see that this expression is actually equal to the second derivative of the approved expression of this expression at the true parameter so this is the relationship between fusion information and the care divergence so this is saying that facial information is equal to the curvature of uh whoops i forgot to define some another thing which means that some of the things i'm saying here is not uh doesn't really make sense okay let me define it here um sorry small definition that is related to the code divergence here so in the setting of having a model p of x w we define a function called k of w which is what i'm trying to talk about below here to be equal to the kl divergence between the truth the true distribution and the model at w okay so let me complete my sentence below before i draw some pictures the facial information is equal to the curvature of this function and this uh rather important function at the true parenthesis okay let's move back to the first board to see some pictures and finally see some geometry hints geometry so if we think of our parameter space w as some blob well it's a parameter space so embedded in some d-dimensional euclidean space each parameter w1 so sort of represents uh represents a model a guess of what the true distribution is the realizability assumption just says that there is some there is a parameter called w naught somewhere whose which represents whose model that it represents is equal to the true distribution and the function um the qubit linear divergence of the kl divergence between two distributions can be made into some sort of quote unquote distance so distance between uh two distribution so it can be descended into a distance um of the parameters however the chao divergence is actually not a metric um in the sense of metric space it is not symmetric as we said so the distance from w1 to w2 is not the same as w2 to w1 and it doesn't satisfy the triangle inequality however there is some sense in which you can make it into an actual distance by uh making it a hessian metric which is a romanian metric um but that's only if um the hessian which is which as we discussed before is equal to
the fischer information is non-degenerate um so that only occurs in regular models so the reference here should be um sort of uh information geometry um which assume regularity and therefore can define a metric and then the actual distance between the parameters is given by um the length of the short shortest path using that remaining metric but for others um uh we would just think of it as quote unquote distance um and the the function k um is just the quote-unquote distance between um the true true parameters and other parameter okay i should also um introduce while we are at it for a given model oops um that is modeling observation so that is generated from a true distribution q of x we define the empirical log likelihood well not quite log likelihood because we have that already is the log logical ratio which is um the ratio between so this is the likelihood ratio between q the truth and the model at w and we are taking the average of that over the data set so this is called the log likelihood empirical likelihood ratio it's it's a random variable because it depends again on the data set and if you take expectation over the data set so expectation over the queue well you recover the kl divergence the k function um which sometimes is called the expected log language ratio due to this reason um and so by the law of large number this expression on the right converges in distribution um 2 2 to the righthand expression which means that this expression in red is approximated really well by the better expression on the on on the left when n is really large and just to point out that k n is equal to if you write this out in terms of two sums is equal to well negative one well the second term is just the log the average log loss the log likelihood is in here and it's negative and taken taken an average of this average log loss loss and the first expression is the empirical estimate to the entropy of the traditional distribution okay let me draw another picture here so it could occur that we have this realizability condition but the non-identifiability possibility is can be represented pictorially as having an entire sub well it's a variety of things of parameters that represent exactly q so this set which we will denote as capital w 0 is equal to the set of w such that the model at w all equal to the true distribution which because of the properties of kr divergence this is equal to the zero set okay and this this set of parameters could have something like that occurring
which would be a singularity which this is an example of a singularity so some other example singularity for for playing curves for curve that i can draw on the black box here would be something like the cuffs singularity and this is um sort of a crossing um or you could have something doing some tangential um business so and what i would say for now and would be made rigorous and make clear is that the geometry near singularity let me be a bit hyperbolic and say that it determines learning behavior um okay so uh that's the hint of um the role of geometry in statistical learning that i'm going to give this lecture i realized that there is a lot of technical definition to get through and this is um this is all we can get up to for this lecture and i think we will see the consequences of this and maybe set up valuations bayesian statistics the bayesian point of view instead of the maximum likelihood estimator there is there is tons of other different ways you can construct the approximating distribution p hat from the data set and from models we'll see the effects of singularity of this set in the future so from now on when we talk about singularity of a model we meant singularity that occurs at the zero set of the kl divergence of the model uh we mean this w zeros at so any points any non-smooth points in this set any question that's the that's all i am going to say for this talk is there any question yeah i have a maybe not a question but maybe a broad comment right so information geometry doesn't really appear uh all that much in uh guatemala watanabe's work right um is the is there a place for this stuff like with with the you know reminding metric on that on that manifold or is it oh it kind of a little bit separate from the singularities that they're talking about um so i was just discussing this with dan and um i actually noticed uh from from like the fifth read-through of the first chapter that in the preface of watanabe book the author actually thanked professor shinichi amari which is which wrote a definitive textbook on information geometry in in his book so data definitely knows each other um and i i think at the moment the the concern of these two different fields even though they share sort of the basic setup is quite different whereas in slt we focus on where the interfacial information metric goes wrong and in information geometry we uh they focus it that they focus on um the effects of curvature and everything like that the i don't actually know much about
information geometry and um uh the language seems to be the same but um i don't know of any direct connection to information geometry beyond the shared definitions okay cool it could be a very important thing to explore though like is there any implication of singularity and resolution of singularity um after blocking slt to uh information geometry um uh i couldn't comment on that maybe my very naive thought that if there's more structure on the space like you know there's like a degenerate metric right um you know the can we use it maybe not because it kind of flies off right that's exactly where those points where the metric is degenerate or singular or whatever and maybe tools from reminding geometry it's just not equipped to handle it and you have to resolve it right yeah yeah okay yeah that makes sense i think dan mentioned something about if it is minimally singular um i forgot import sense like when it is just degenerate in a few directions and uh skew maybe okay minimally singular i think uh reference dance deep learning in singular paper appendix i i wouldn't try to recall the definition here then information geometry actually have something to say about things that it sort of still applies to fixable um but i think you made a really good point which is like um what if we could do information geometry on the resolve space um maybe i don't really know the implication that may be able to do it you're going to have trouble extending that metric over the exceptional locus though yeah yeah you know there's certainly not a unique way of doing it right uh even if you can uh anyway yeah thanks yeah yeah no worries um any other question um i think for some of the um singular learning theory people here this is very much a revision i'm sorry if that's a bit boring is there a supplementary session today yeah i think there is um liam sure is all right sorry i'm supplementary i didn't actually set up if people want to i can i can start writing out some basic specific stuff while we're on the tea break actually i think i will do that if people go off i just write it on the board or something cool sounds good enough now okay cool cheers all right so i believe this is t break time okay um okay so um i guess stop the recording and i will just start writing some patients stuff of um well you are welcome to ask me any other question about similarity theory or ask liam about it thanks for the talk edmund it's very great thank you um i guess i can from the okay thanks eaton