WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
This video explores how a linear model can be used to map from an n-dimensional feature space to an m-dimensional latent space. Experiments reveal that the sparsity parameter s affects the model, allowing for more features to be encoded with higher sparsity, but potentially interfering with each other. Results show that the model can pick the two most important directions in a dataset, with importance decreasing with feature number, and that the true distribution of the inputs can change during training.
This paper explores a simplified toy model trained on a toy dataset. The task is to compress and decompress an n-dimensional feature space into an m-dimensional latent space. Varying a sparsity parameter reveals the model chooses different strategies. The data set is generated by setting each component of a n-dimensional vector to zero with a probability of s, and the remaining components are drawn from a uniform distribution between 0 and 1. This parameter s controls the sparsity of the data set and affects the models used in experiments. The paper also explores the mechanistic exploration of polysemanticity, which is related to superposition, and provides insight into the conceptualization of SLT and phase transitions.
A linear model is used to demonstrate that there is no superstition in the data set when mapping from one layer to the next. The importance of each input is weighed differently using a sparsity parameter s. Experiments use a single parameter to represent the decay rate of how important each input is, and this is used to draw phase diagrams to incentivize encoding more important features at the cost of slightly less important features. The model consists of a projection from an encoding part using a parameter matrix to a hidden space, and then a decoding part using a ReLU. An example of this is given to show how a one hot gets sent to the latent space.
Sam was replicating the results of a paper which used Rayleigh networks with a swish activation function. He observed that increasing the sparsity allowed for more features to be encoded and fitted, but with higher sparsity, more features had the potential to interfere. With a large gamma parameter, the model became slightly messy, with some values not quite at zero, which was a good approximation to Rayleigh networks. The loss function expression included an importance weighting, which was demonstrated graphically with a five dimensional feature space and a two-dimensional latent space. The model could pick the two most important directions in a dataset, with importance decreasing with feature number.
A plot is used to categorize models based on their importance and sparsity. The categories are Not Represented, Dedicated Dimensions, Superposition and Unbounded, and refer to how the second feature affects the model's performance. During training, the true distribution of the inputs can change, with early layers potentially delivering something stupid and later layers delivering something more sparse and with a clearer hierarchy of importance. This varying true distribution can be relevant if one thinks about it as being separate from the time scale, and is illustrated by graphs showing different shapes and colors as the densities of the inputs change.
Sam is an undergraduate at Anu who did a vacation scholarship at Uni meld over the summer. He was interested in AI alignment and took a course on interpretability last semester. His work focuses on feature interoperability, trying to figure out what specific parts of neural networks do. His toy model is a continuation of the distill threads, which left a loose end lying about. He is trying to maximize a particular neuron within a network and find images that correspond to it.
This paper explores a simplified toy model trained on a toy dataset. The task is to compress and decompress an n-dimensional feature space into an m-dimensional latent space. Varying a sparsity parameter reveals the model chooses different strategies. The paper explores the mechanistic exploration of polysemanticity, which is related to superposition, and provides insight into the conceptualization of SLT and phase transitions.
A data set is generated by setting each component of a n-dimensional vector to zero with a probability of s, and the remaining components are drawn from a uniform distribution between 0 and 1. This parameter s controls the sparsity of the data set and affects the models used in experiments.
Data sets often contain vectors with mostly zero entries, representing individual words in a corpus of sentences. A network layer can be thought of as a vector of activations, and when mapping from one layer to the next, it is assumed that the activations have semantic content. This gives rise to the sparsity parameter s, which is used in the loss function to weight different features differently. This is part of the justification for the model being trained on the data set.
The paper starts with an importance parameter per coordinate, but later the experiments use a single parameter to represent the decay rate of how important each input is. This is used to draw phase diagrams where the x-axis is the number and the y-axis is the sparsity. The model incentivizes encoding more important features at the cost of slightly less important features. The importance parameter is chosen by the implementers and is intended to match what is reasonable in the real world.
A linear model is used to show that there is no superstition and set a baseline. The model consists of a projection from an encoding part using a parameter matrix to a hidden space, and then a decoding part using a ReLU. An example of this is given with n=4 and m=2, where the columns of the parameter matrix are visualised as vectors in the latent space. This shows how a one hot gets sent to the latent space.
The loss function expression is a basic regular looking plus function, except for an importance weighting. This is demonstrated graphically with a five dimensional feature space of the original data set and a two-dimensional latent space. When sparsity is set to zero and the non-linear model is trained, W1 is somewhere and W2 is orthogonal to it, while W3, 4 and 5 are set to zero. This gives a visual representation of the importance weighting and how it affects the loss function.
Non-linear models can pick the two most important directions in a dataset, with importance decreasing with feature number. With low sparsity, the model behaves similarly to a linear model, but with higher sparsity, all features are represented and encoded. This is because higher sparsity means less likely to need to encode multiple things at once which could interfere. With even higher sparsity, more features can be fitted, but they have more potential to interfere.
Sam was replicating the results of a paper which used Rayleigh networks with a swish activation function. He observed that as the sparsity increased, it became a worthwhile bet to take the risk of hacking anything in. At a certain point, there was a phase transition which caused the model to become slightly messy, with some values not quite at zero. This was because of the gamma parameter, which if made large, was a very good approximation to Rayleigh networks.
A plot of relative importance and sparsity is used to categorize models. The plot has two dimensions; importance (R2 to R1) and sparsity (starting from 0.5 to 0). The models are categorized into four categories - Not Represented, Dedicated Dimensions, Superposition and Unbounded. These categories refer to the second feature and how it affects the model's performance.
A plot illustrates how a model encodes two features. Blue indicates that the first feature is completely ignored and red indicates that the two features are combined in a superposition. Depending on the balance of the two features, the optimal strategy shifts in a discontinuous way. This type of phase transition is associated with varying the true parameter or true distribution, rather than the prior, which is more commonly discussed in SLT.
The paper discusses inputs as activations in a deep layer of a network. During training, the true distribution of the inputs can change, with early layers potentially delivering something stupid and later layers delivering something more sparse and with a clearer hierarchy of importance. This varying true distribution can be relevant if one thinks about it as being separate from the time scale. The paper also has graphs showing different shapes and colors as the densities of the inputs change.
This paper presents a geometric model for understanding the emergence of different behaviors in small neural networks. It is based on the Frobenius norm squared, which is a count of the number of features represented, and the x-axis which is foreign divided by one minus s. This model is related to singularities, and it is possible to understand the connection between singularities and the phase transitions that govern them. It is a clear example of how geometry can be used to explain the behavior of neural networks.
all right I think we can get started so thanks Sam for uh for leading this and uh it's such short notice really so uh maybe you can start by introducing yourself briefly and uh explaining how you came to think about this and why you care okay yeah um okay um so okay I'm introducing myself yeah I'm Sam I'm an undergrad at Anu at the moment um yeah so why do I care oh how did I come to this particular piece of work so like really how did I come to be here is because um I did like the vacation scholarship thing at Uni meld um over the summer and this kind of thing was like the topic so um that's how I came to be here specifically why was I interested in this piece of work in particular is kind of um so uh before I came to UNI I was like interested in like the AI alignment stuff um and I came to UNI like in pursuit of that kind of and a course last semester which was um kind of like a bit of a project course on interpretability well it was on like that broadly and then what I ended up focusing on was like um yeah like interoperability work in the vein of like um like there's still the distilled threads basically if people have seen those um which was like feature yeah it was like um yeah it was trying to figure out like what specific parts of neural networks did like what does this neuron do or like what does this particular direction in this layer of the neuron represent um yeah and and this piece of work which I'm basically going to talk about this one um paper um this toy model is a superposition it's like kind of it it kind of picks up a little bit where that work left a loose end lying about so um yeah is that great is that an introduction thanks yeah okay yeah um okay so yeah okay maybe I can like segue so um yeah in that like previous work um where they're trying to like they have like some Network and maybe you know um just a sketch um so just the previous work distill was the name of the kind of Quasi Journal you're talking about a series of papers in that or is it there is a specific paper called distillation or something you just means the whole body of I mean the basic I mean the I guess I mean like the circuits threads sure yeah or or um yeah basically the body of work that is in distill which relates to interprebility um yeah so yeah one like loose end or like one thing that came up in that was that like hey we picked the neuron and we found an we found images which like maximized it or we optimized to find images to maximize some particular neuron within a
network uh and we got like weird things like I think their example is like they had a cat and a boss um so like why do we get cats and bosses both activating this neuron and like the kind of simple slash obvious answer is that like it's not individual neurons which make up meaningful representations it's like it's a direction within a layer um yeah but like I guess they weren't for this in that paper was like poly semanticity um which like is what kind of becomes superposition in this new paper like superposition is like the mechanistic like exploration of how paulicenticity works I guess is one way to talk about is one way like give a bit of like history to this paper um anyway uh yeah so I might actually start talking about the paper now um so uh like the overview is something like um overview so like uh yeah so this paper is like it's mostly an empirical like um exploration um exploration of like very simplified very neat um toy models trained on a toy data set um yeah okay um and like what they end up finding is uh some nice neat things like um yeah they vary this Criminal in of that data set this sparsity parameter and they find that like as they vary the sparsity parameter their model chooses like different strategies especially like yeah these won't mean anything right now but like they might come to me in something hopefully I'll explain um yeah so um I guess I'll begin by talking about the like the structure of the toy data set and then the structure of the toy model um and then the results in like what happened as you vary some parameters within the data set what happens to like the model that is that you get after training um yeah and then at the end maybe I'll try to like I don't have a great grasp of like technicalities of SLT or like the actual like um implementation or so with like a hand wavy um this is maybe an intuition that might not be misguided about like what might be going on with these in terms of the kind of conceptualization that is in some of the SLT discussion or phase transitions okay so okay describing the data set um okay um yeah so I guess like um maybe like maybe it's not a great idea um let's say um yeah so the task that we're gonna be training our model to do is to take is is basically Autumn to like be an auto encoder um to to compress and then decompress some like data set of vectors so like yeah where we'll say that like the original feature space is n dimensional and like the latent space or the hidden space will be M dimensional
um and like the task is to take yeah to take vectors from RM to encoded vectors in RM and back to RN on okay um and the specifics of what this like collection of vectors look like are important because like basically the experiments are varying this parameter which controls with the like the a property of the data set and seeing how the models change so um yeah so like the important notion is like sparsity uh yeah sparsity so for each of like the vectors in our data set uh we have this parameter mu which basically describes like the probability that any component of a of a data point is zero sorry Sam I think it's probably better to stick with the notation in the paper if that's not going to throw you off too much oh um what is the notation in the paper oh it's s but if you're working from notes and it's mu everywhere stick with mu but just for the just for the audience it's an i yeah SS is a good idea s makes a lot of sense I'm not sure why I wrote new um I think I might have written you in one of our meetings that might have been where that came from so maybe I'll I'll take the blame for that um yeah it's awesome so you said perfecta which isn't is that really true I mean it's like for you fix a sparsity and then I don't know yeah you sample uh yeah I guess maybe I'll say it like the data set is generated via yeah um yeah oh yeah so like um yeah so we have this parameter s which is between zero and one like yeah and then we generate our data set I'm going to call it Big X um so our data set like an element well yeah a point in our data set is um firstly like yeah um yeah a pointing out data set is uh an n-dimensional Vector where with probability s a particular entry is zeroed or is set to zero and then our entries which are which were not set to zero are drawn from like a uniform Distribution on zero one so um like um I guess maybe um it's a bit hard to write but I guess you can write X as a column Vector with X1 through x n in it and then you can say how the X I are generated right um have the x i generated yeah okay yeah um x to the column vector yeah so with each component um equal to zero with probability um with probability uh the s and um x i drawn from the uniform distribution um zero one otherwise oh wow yeah I think that's I think that makes sense I remember reading at least knows what I'm trying to get I'm getting thumbs up okay good good good good um okay so like um yeah so this like sparsity thing uh one like note is that like um at least in some ways it makes sense
for like lots of real data sets are like Barry's boss um so maybe like the example is like in like a text yeah in like a corpus of words in a corpus of sentences say like um if if this is not a like a concrete analogy or a concrete um anyway in like in a corpus of sentences if our features are like individual words like most sentences have zero instances of a particular word in them so maybe if like our data set is uh vectors where each entry in the vector counts the number of some particular word in that data points corresponding sentence like most of those vectors will have mostly zero entries um so I don't know this is just like yeah it kind of makes sense I think if you think about uh think about a network which is you know somewhere deep in the network layer 10 and then you think about the RN as being the vector of activations at that layer and then the map from that layer to the next layer suppose it's smaller we're assuming m is less than n um so we're sort of thinking about some synthetic version of the distribution over vectors in the tenth layer these RN these X's uh generated by passing some other input distribution maybe it's images through the network up to that point and so yeah if you if you accept that uh most like at that point in the network the activations kind of have semantic content then they will tend to have that kind of sparsity that you're describing and yeah I didn't I didn't get this motivational point from the introduction of the paper the first time I read it but so that's the kind of I kind of half believe it and I half believe that they just like this is the reverse engineering that gives you interesting stuff so this is the distribution they chose but I think it does seem fairly well motivated yeah um so this is yeah so this is sparsity um yes sorry uh same while I'm interrupting you can you move your character a little bit to the side so we can see the board yeah sorry um what was it um yeah okay so so our data set we have this parameter perhap this sparsity parameter s not mu um we also have so yeah so this is like another parameter basically that they use um they have like this notion of importance so like I guess this actually comes into play in the loss function we'll come back to that um but like they to in the in the loss function they train this their model on this data set um in the loss function they use they weight different features like differently um and and like they're part of that justification with that is because
like in Practical applications different features are more or less useful like more or less important to reconstruct something um or for some like future tasks it's more important or less important uh I kind of think that is less Justified than this Varsity reason and like more so I kind of believe that they want to like it gives like the um the resulting model like consistent order or a consistent ordering of vectors across like different trials and different sparsity settings so importance is another thing to be described um maybe it's worth noting in the in the paper they they start with an importance parameter per coordinate but later I think in all the experiments it's just the the first parameter and then the second parameter is you know like it's one the importance for the first coordinate and then it's K and then it's K squared and then it's K cubed I think something like that red so that the parameter is actually like the Decay rate something like that of how important each input is yeah yeah so they start with independent importance weightings and they move to like basically that yeah and just for the sake of connecting this to the phase transition so that they will actually draw phase diagrams where on the x-axis is something like this number and on the y-axis is is the sparsity or something like it so yeah yeah um yeah so uh yeah so importance comes into play yeah in in yeah in like whether a network is like basically incentivized to represent like lower important features at the cost of slightly yeah like the the it determines the balance of like I want you to grade the performance of encoding more important features to encode another somewhat less important feature um uh okay is there something I'm leaving out of the data set description I don't don't think so um okay and then their model is so I mean the model there the model that I'm going to talk about is um they have [Music] oops [Music] well I'm actually glad it was him because otherwise I was going to have to quit and rejoin I thought it was my audio issue so selfishly I'm pleased that it's not my problem we also experienced a rising blood pressure to carry back to health yeah that is disappearance okay well it's reconnecting I just want to ask what's the importance it wasn't clear to me in the paper that did the implementers just choose it or was it like determined by the model oh it's just a choice um I mean you could say they're attempting to have the this distribution match what sort of is reasonable out in
the wild in a large model but like what they think a feature should be or what makes sense to them well I mean yeah I guess Sam will maybe say a bit more about that when he writes down the loss function so yeah let's revisit that question welcome back Sam sorry for the instability oh it's probably me it's probably mine um yeah okay um yep the model yeah so they give um they give a linear model which is like to basically show that you don't get superposition I think when when you do I know you don't get Superstition if your model is linear but I think the reason they give one is to set some kind of Baseline I guess uh but uh yeah okay so um okay yeah so the first yeah the linear model is um it's like the the encoding part of it is just multiplication by some uh parameter Matrix um and like the decoding part is I think oh yeah there's a bias um yeah yeah um yeah so yeah basically um this W is like the projection from uh and to uh m um and like Columns of this w like is is the direction in the hidden space which corresponds to like uh yeah the basis direction of the feature space like a feature if that yeah just to be clear on the jargon so the RN is the space of features in their language and each independent direction is thought of as being our feature not only that but those directions are the basis in RN right yeah they're the elementary bases so each feature is like a one hot yeah yeah yeah yeah yeah um so like yeah um yeah um I think it's it's useful maybe now to like talk about like to like draw like um so this well this is like the linear model uh and then like the the non-linear model is basically the same thing except we use a railu at the very end so um yeah it's just uh uh encoding is the same as projection by this Matrix and then decoding is relu of the same um if we had so as an example like if we had um say like n equals uh yeah n equals four um and M equals two like this um is like the hidden the hidden space the Layton's face or yeah um and we can like plot the columns of w or was that right yeah we can plot The Columns of w in this space so every column of w is like a vector in RM is that right yes yes that's right um so like we might get something like this um where like yeah so what I'm saying is like um for for like for a latent space of like two or three like we could just visualize the white Matrix as yeah the directions in the latent space corresponding to um well The Columns of the parameter Matrix or like where a one hot gets sent basically yeah
um yeah um okay um yeah I'm always talking about things I'm supposed to speak about yeah yeah okay so yeah the loss function I think that weight loss function I should move out of the way the loss function uh so um yeah so this is this is their expression so some over the data set um that it's drained on over the features of each element of each data point of this importance importance per feature of like the squared squared difference of input and output which I will write like no no I like that um yeah so if we call x dash like the decoded encoded X um then yeah so um yeah it's basically it's like it's a pretty basic regular looking plus function except for this importance weighting um yeah which we have chatted a little bit about already uh yeah okay um yeah okay so I guess I like I'm gonna draw like their nice diagrams which like display the behavior that we get as we vary sparsity um so yeah so yeah so with with m equal to five no with n equal to five so yeah with a five dimensional feature space of our original data set and like a two-dimensional latent space um and then like as opacity which like sort of interrupts them but I might had to give a bit of motivation for this importance Factor so if we go back to the example of RN being the tenth layer of some confident or something um if you integrate over all the inputs those aren't the x's in this story right there they're somewhere before and the X's are the activations of the network on seeing those inputs then if many of the images in your data set have cats then at this point in the network this neuron should be active and somehow like you'll get a contribution to this loss every time you have a cat in an image because it somehow goes through this part and sort of accumulates a little bit of error for each one of those images so there's like an outer integral of our actual inputs and that sort of concentrates to give many instances of this Square I guess that's probably what they have in mind thank you um okay um yeah so uh yeah I'm I'm gonna draw like their nice diagrams um yeah from close to the beginning of the paper uh so when we when they had a sparsity of zero and they train that um the non-linear model from before um basically uh I don't have a good way to do this do I think it's something that looks like this which is um so they have like W1 is some is somewhere uh and I I think it doesn't matter um and W2 is somewhere else orthogonal to it um and then W3 4 and 5 are just sent to zero so like this is like
uh sparsity zero with busy zero a non-linear model does like the same thing that you would expect a living model to do which is just like pick the two most important um directions in a data set yeah it's like I mean this eye um yeah I mean I don't know if I said that beforehand but like the importance is like yeah it's a it gives an ordering yeah it like Anyway the importances are chosen or the features are chosen in an arrangement where like the importance decreases with like the feature number um so yeah like I'm not sure anyway this the non-linear model behaves just like a linear model when we have zero sparsity um [Music] yeah which is something um for like a reasonably or like a I guess a middle length sparsity um we got different Behavior oops so far through 0.8 uh we get this thing which two three four and W5 um yep I'm not sure um no I'll draw the last one which is maybe the most interesting one or I guess there are I guess this one is interesting um for an even higher sparsity like we get all of the features being represented um or being encoded in some way it looks like a fifth I'm not sure um yeah so and like these are all supposed to be the same angle um uh yeah so what is going on here um yeah like I guess the intuition is something like um with with greater sparsity you're more likely to not have to include encode different things at the same time so like for instance like in in this one you know we say we want to encode the vector like one um zero one zero zero uh is that too many no no um like this vector when you when you project it down via W just becomes zero because um like W1 and W3 are like directly opposing each other so then when you try to leave this space you just get like relu of the basis of the bias um I think uh yeah yeah WT plus b Ray Lou is just freely um so yeah like basically that basically that idea that like highest velocity means less likely to have to be to need to encode multiple things at once which might be interfering with each other and then this one like we have even higher sparsity and um yeah so we have things like do we if we had like hmm yeah I'm not sure I think the story I mean I feel like I feel this one I would say you're basically the same thing again about this one except that uh it can basically it can like fit even more things which have more potential to interfere with each other if they did happen to um coincide in the same individual data point yeah the story to be told is about the the place in between where it switches so
clearly you're paying some price for not being able to reconstruct W5 in the middle diagram but as you increase the sparsity it's sort of it's worth taking the risk of in of representing W5 because it's so rare that you'll have to recover a linear combination um and yeah even if you do it's it's very rare that you're going to actually have to represent the two opposing things that kind of cancel each other out in some sense so you might as well actually spend uh you know whatever the parameters to represent the uh the fifth thing um yeah so yes I guess like now we could I could say something about phase transitions which is like yeah um so like I mean yeah so yeah basically like I mean these are like this these are like discrete strategies like you like yeah like you were just saying like at some point it becomes a worthwhile bet to take the risk of hacking anything in um and like yeah yeah so um I mean we never get like well actually this is like maybe slightly not worthy um this is all with railu networks um with like a swish activation function instead um even with like a reasonably High gamma I might not write this down um like you I'll draw a little picture you kind of do you get like some amount of like non-discreteness I'm not really sure what's going on like you get um well let me find the picture um what do you mean by discreetness uh I guess I mean like we have like we have a thing which has four and the fifth one exactly at zero and and then we have a thing which has five and we have no like in between between them like we never get a model which is like which looks like them like starting to make space and this one's starting to leave oh I see I don't know it's just like I'm pretty sure that that's like obviously not a good strategy but like as like this is like where the hey this is this must be like some kind of phase transition going on I don't know anyway this is what like we don't get this with Ray Lewis but with some like trying to replicate this we do get slightly messy things like um got one that looks like this which is like yeah something that's like not quite a zero and a couple of other just to fill in some context for those who didn't supervise your vacation scholarship project uh so Sam was replicating the results in this paper and uh tried it with swish non-linearity which is an analytic approximation to really and there's a parameter there which is the gamma he mentioned earlier which if you make it large uh swish with that parameter is a very good approximation
to really use a large enough it will look like really but we were sort of curious how big it had to be before [Music] um um yeah okay so yeah um oh okay I'll first talk about like their phase transitions um and then try to talk about like I mean the same thing but I'll draw the diagram um so yep um yeah I'll just I'll just copy again so so yeah they have this plot um which is yeah um it has okay so on the x is like importance or relative importance importance so okay I'll be better um it's R2 to R1 I think all right yeah uh yeah yeah yeah um so yep and then so we have like we really only need like one importance waiting which is which is the importance of the second one because like yeah there's only two features so the only thing that really changes like cattle loss will behave is the relative importance of those two features so um yeah so over here on this side like um we have the like the second feature is only yeah it's it's a tenth of whoa it's a tenth oops there is an undo if you find that useful oh thank you um yes uh importance um yeah so I guess like on this side is the second feature is less important than the first in the middle is that they're equally important and on the right is that it's the second feature is more important um so yep and then the the y-axis is the sparsity oh yeah I guess I call it the future density uh I can just do something now it's varsity so yeah so um yeah so yeah they start with a sparsity of nope starting with positive zero zero quick at 0.5 9 um uh so um and yeah so so like yeah so this is like yeah two dimensions of hey we're going to change the the sparsity parameter in the data set that we're training on and we're going to change the weighting of the features the relative importance of the features that we're training on to get like a particular point in this plot um and then they look at like what kind of model resulted from the training on that data set with that loss and like they categorize the models which can come out of this basically um so like like the models we can kind of plot them like uh if we're going R2 or to R1 um maybe we can't we can't put them in the same way you can just give the names for the different kinds that they have I think they're pretty self-explanatory okay um Okay so not represented not represent is this what you're talking about these not represented yeah that's right yeah um dedicated dimensions and super possession this yeah so yeah so these are yeah these are in reference to the second feature
so like yeah so models yeah I'm gonna color in this whole plot um and like the spaces where it's white it means that the second feature was not represented At All by the model so the model just encoded the first feature um blue is that it gets its own Dimension and the first feature is completely ignored and red means that superposition happened um so okay and it looks like this um it's like yeah this section uh ignores the second feature this section ignores the first feature and all of this the stuff with high sparsity um puts them in supervisions uh and like okay yeah so that's like a description of the behavior across a variety of different settings um and it kind of says yeah if if you have one feature that's super like it's much much more important than the other then you'll get to position only with greater sparsity because it's like just not relevant to deal with that other one and same thing over here um whereas like if they're equally if they're more equally balanced you're more likely to get to position with still you know less varsity um the interesting thing through phase transitions is like we have these boundaries um like across where like so yeah so there yeah basically we have these boundaries um where where the like the optimal strategy shifts in a discrete way or like yeah in a discontinuous way or yeah um um yes yes okay um that might be all I say about that I think and then I'm not sure what else to say cool yeah I think that's great let me just check the time yeah we've got about 13 minutes left I'm in this it's a fairly long paper we can talk about many other aspects of it but maybe this is a good place to stop for this time um thanks Sam and uh yeah let's have some discussion or questions and uh I think we could certainly revisit this maybe somebody else could volunteer to speak about some of the remaining parts mm-hmm questions maybe I'll make a comment while you're collecting your thoughts um maybe it's not so much a comment as a a brag uh well so it's interesting to note here that the phase transition is associated to varying the true parameter or true distribution right so in SLT lingo and so the uh we're tuning two parameters that determine the distribution which is being matched by the model this is not the usual I mean it's to the extent there is a discussion of phase transitions in in SLT it's in a couple of watanabe's papers and his book and it's usually to do with varying the prior um this is the kind of phase transition that Liam was studying in his master's
thesis a similar kind of setup well not similar setup but the the true distribution was also the thing varying there and it's worth noting why I mean this looks weird from a statistical point of view perhaps because well the true distribution is true man it doesn't vary but if you keep in mind this analogy that I was making well that we were discussing earlier and which is put forward in the paper that we're thinking about these inputs as being activations in some deep layer of the network right so there it's not strange at all for the true distribution to change if you think about that happening over the course of training right so uh think about it as early on in training like the early parts of the network are doing something stupid and they deliver up to you the long-suffering neuron in layer 10 or 11. something or other and you're trying to you know fit to that and then later on in training maybe they're delivering you something rather relatively different maybe it's more sparse and it has a high clearer hierarchy of importance or something so a varying true distribution could be relevant if you think about it as being somehow separate out the time scale so the the later layers are maybe learning somehow much slower than the earlier layers or something like that so there's maybe scenarios in which a very true distribution is relevant if you're thinking about small parts of a model at least that's how I'm thinking about it right now yeah comments other questions for Sam uh baby not for Sam but uh do you see how this would sort of link to SLT in any way um just yeah like Liam's the assisted yeah I mean I think it's kind of one can formulate this in SLT in a straightforward way and then it's uh uh apart from the radio of course but that's kind of the point of trying out the swish thing um so model of that which I don't think is a big difficulty this is straightforward to formulating SLT and even you can kind of see how the explanation for these phase transitions should go this is speculative but um maybe Sam could you put up the they have these graphs of um like uh what is it representation Perfection or something uh yeah yeah what page this is on the different like shapes and colors as they are densities passing changed yeah uh I can I'm not sure where there's I can draw like I can draw one of the plots I made sure yeah yeah head in mind the the plot where on the vertical axis there's they call it m divided by the frobenius norm um like Dimensions per feature and then
on the x-axis is foreign divided by one minus s yeah yeah yeah I've got that up in front of me right now I guess it looks something like this I think yep yeah right so this this vertical axis which is the the frobenius norm squared and in their telling is kind of a count of how many features are actually represented um which seems reasonable um and then the x-axis as you go from left to right it means more spas hmm yeah so it's um I don't pretend to completely understand it but it's clearly the configuration of the eigenvalues of w is relevant here and that's this is actually very close to how you set up a mathematical study of um Ade singularities and so I I think it should be possible to understand the connection between those singularities and the the kinds of singularities that are governing these phase transitions at least that would be my guess I think it's probably possible with some work to to understand rather deeply what's going on here in the framework of SLT might be too optimistic but I think the questions comments hands put up for speaking about the rest of the paper no more thumbs up everybody's quiet Kenneth are you awake have you discovered your predicament yet oh hey I can use them I can use the blocks now yeah everybody can they're available on mobile yeah I think this paper is really important uh and interesting uh for several reasons one it's just cool to see these kind of geometric forms appear as I said it seems uh pretty likely that this is actually a very standard appearance of these kinds of configurations to do with um things like the Makai correspondence and configurations of eigenvectors so maybe it's not astonishing but I think it's a very beautiful toy model for understanding the relationship between the emergence of different behaviors in small neural networks and geometry and it's uh yeah doing it it was really cool yeah yeah sorry and not obvious at all I mean it seems simple the model they came up with but uh yeah I wouldn't have come up with that so I'm very grateful there they um found such a clear test case I I'm happy to tentatively maybe say I can give presenting on this since I awesome but uh what I heard on about the other confidence I haven't actually finished the paper yet I have confidence yeah all right yeah anybody else want to comment on the paper as I said I'm sorry go ahead Russell no no this is really not worth worth it but I I did I I took um the call to action to heart and I've managed somehow to find the time to read the whole thing