WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


This video explains the concept of network learning and how it is used to understand features in the world. It discusses the use of dimensionality and sparsity to measure the geometry of the network, and how Principal Component Analysis can be used to represent high-dimensional data. It also introduces the Thompson Problem and the Johnson-Lindenstrauss Lemma, and how varying the sparsity of features can influence the Pentagon Diagon phase change. The video ends by showing how features can approach a finite dimensionality and form an octahedron shape.

Short Summary


Network learning is the concept of understanding features in the world using neurons. A matrix is used to act on the input and a loss function is used to check if the network is learning. The authors used a metric called dimensionality to measure the geometry of the network, which is a ratio of the dimensions of the embedding space to the number of features. Experiments showed that the dimensionality per feature tends to stay around the one half mark for a while, but changes quickly. The sparsity takes discrete values between 1 and 90%. The graph also showed that one feature can be embedded into one dimension.
A paper embeds 30 features in 30 dimensions when the sparsity is 0%. As the sparsity increases, more features can be embedded into a finite number of dimensions. The ratio of the total number of dimensions to the number of embedded features is calculated and a graph is produced. Closely connected dots have high dot products, while more distant ones have lower dot products. The graph shows that the dimensionality of the features is either zero or a finite number. As the sparsity increases, more features are embedded and the dimensionality is either zero or a finite number.
Principal component analysis is a technique used to represent high dimensional data in a two-dimensional space. It works by measuring the radius from zero and plotting the angle away. Different levels of sparsity can result in various shapes such as tetrahedrons, diagonals, triangles and bagels. It is also possible to represent 3D and 4D matrices as columns of w and squares respectively. The lines between the features represent the dot products between each of the vectors. This technique provides a visual way to represent data.
The Thompson Problem is a famous unsolved physics and maths problem which involves finding the way to place repelling electrons on a sphere such that the potential energy is minimized. Exact solutions have only been found for up to 12 points, with computer-aided search algorithms only finding solutions up to 400. A paper recently connected this problem to a reconstruction problem with two minimization problems, although it provided little mathematical evidence of the connection. Additionally, the Johnson London Strasse Lemma provides an upper bound on how many features can be feasibly embedded into a given number of dimensions.
Johnson-Lindenstrauss lemma states that a vector in a high dimensional space can be projected into a lower dimensional space while preserving its length, up to a scale factor of one plus or minus Epsilon. This is done with probability of order one over N squared, where N is the dimension of the space. A linear map is used for the projection, and as the sample size increases the probability of the ratio of the two norms of the projected pair being outside of a certain range decreases, bounded by 2/N^2.
In this section, the model is trained on high-dimensional spaces with five features and two dimensions. The importance of all the features is one, except when the importance of one of them is varied, known as non-uniform superstition. The Lemma states that if some error is allowed, more vectors can be packed than expected, which is illustrated with examples from a 1D line, a 2D circle and a 3D sphere. As the dimension n increases, the probability of finding two points within the desired range also increases. This is demonstrated by the graph, which shows that as the sparsity of one feature is varied, the importance of all the others remains one.
Researchers studied the effect of varying the sparsity of features on the Pentagon Diagon phase change. When the sparsity of a single feature was varied, the Pentagon regime changed when it crossed a certain range. Pairs of features were also studied, which were linked with the same probability of being zero or non-zero. The embedding dimension was set to two. A picture is drawn to indicate correlated features, with training of a model discussed. Two graphs are used to illustrate finite dimensionality, with features approaching a finite dimensionality over training and forming an octahedron shape. As the features approach more stable solutions, there is a drastic drop in the loss.

Long Summary


Network learning is a concept where features in the world can be understood by a neuron. In this paper, features are represented as vectors in a dimensional space. A matrix is used to act on the input of the network to get the output. A loss function is used to check if the network is learning the features. Experiments also use sparsity as a parameter, where each feature is drawn from a uniform distribution between 0 and 1 and each Theta is equal to 0 with a given probability.
The authors of the experiment used a metric called dimensionality to measure the geometry of the network. This is a ratio of the dimensions of the embedding space to the number of features. The parameters of the network, W, can be decomposed into a matrix with n columns, where n is the number of features. The convenience norm is then the sum of the columns of W. The dimensionality is roughly equal to one if a model has learned the feature and zero if it has not.
Researchers plotted the number of features that can be embedded into M Dimensions (where M is much lower than N) as a function of sparsity. The graph showed that the dimensions per feature tends to stay around the one half mark for a while, but changes quickly. This is believed to be a quirk of the model used, and is further supported by the second graph which shows that the sparsity takes discrete values between 1 and 90%. The top of the graph also shows the dimensionality for each feature, with the pink line showing one feature embedded into one dimension.
The paper embeds 30 features in 30 dimensions when the sparsity is 0%. As the sparsity increases, the number of features embedded increases to 29. The features are represented as coloured dots in a low dimensional space. If two dots are connected, it means that their dot product is proportional to their distance. If they are not connected, it means that their dot product is zero. Closely connected dots have high dot products, while more distant ones have lower dot products.
Embedding more features into a graph can increase the sparsity of the data. As the sparsity increases, more features can be embedded into a finite number of dimensions. The ratio of the total number of dimensions to the number of embedded features is calculated and a graph is produced. The graph shows that the dimensionality of the features is either zero or a finite number. As the sparsity increases, more features are embedded and the dimensionality is either zero or a finite number. The number of embedded features is calculated and a graph is produced, showing that the dimensionality of the features is either zero or a finite number.
The model is likely to put two features in the same axis in the subspace due to the value model being taken between 0 and 1. These two features form a diagon pair which are antipodal and have strong dot products. This means that if one of the features is non-zero, the other is likely to be zero, providing the model with orthogonality. The sparsity is between 25 and 75, meaning the model is comfortable taking the risk that the other end of the diagon pair will be zero.
Principal component analysis is used to represent data in a visual way. Shapes such as tetrahedrons, diagonals, triangles and Bagel shapes appear at different levels of sparsity. At one point, five pentagons were the only recognizable shapes, with a mess of stuff in the middle. This mess has no geometric or interesting features, and it is not clear how the pictures are being made.
The text discusses a technique for representing high dimensional matrices in 2D space. This is done by measuring the radius from zero and plotting the angle away. The figures in the far left are further away as they have been projected. Later in the paper, the authors showed a matrix with 3 dimensions and expressed it as columns of w. Similarly, a 4D matrix was represented as a square. It is believed that the lines between the features represent the dot products between each of the vectors. This technique is used to represent high dimensional matrices in 2D space.
The Thompson Problem is a famous problem in physics and maths which involves finding the way to place repelling electrons on a sphere such that the potential energy is minimized. For two points, the solution is to place them on the North and South Poles or any rotation or affine symmetry. For three points, an equilateral triangle around the sphere works. Exact solutions have only been found for up to 12 points, with computer-aided search algorithms only finding solutions up to 400. Sparsity can help to create a Bagel in the middle with orthogonal shapes that are not connected and then quickly change into a pentagon of pentagons.
The speaker discussed a famous unsolved math problem known as the Thompson problem. The paper they were discussing connected this problem to a reconstruction problem with two minimization problems. The paper did not provide much mathematical evidence that the two problems were connected, however the shapes of the solutions looked similar. The speaker also mentioned the Johnson London Strasse Lemma which provides an upper bound on how many features can be feasibly embedded into a given number of dimensions.
Johnson-Lindenstrauss lemma states that given a vector in a high dimensional space, it can be projected into a lower dimensional space while preserving its length up to a scale factor of one plus or minus Epsilon. This can be done with probability of order one over N squared, where N is the dimension of the space. The map used for the embedding must be at least lipschitz and can even be taken to be an orthogonal projection. There exists a linear map that can be used for the projection.
In this proof, a random point from a subspace is projected onto a set of endpoints. A bound is assumed which states that if beta is less than one, the probability that a quantity L is less than or equal to K/D is bounded above by a certain quantity. Applying this bound, the inequality of 1-Epsilon is proved by multiplying both sides by a negative, simplifying and then applying the exponential. The result is bounded above by a certain term.
A linear map can be used to project a sample set onto a new space. As the sample size increases, the probability of the ratio of the two norms of the projected pair being outside of a certain range decreases. This is bounded by 2/N^2, meaning that as the sample size increases, the probability of the ratio being outside the desired range decreases.
The Lemma states that if two points are chosen randomly in an n-dimensional space, the probability that the points lie within a certain range is at least one minus Epsilon. As the dimension n increases, the probability of finding two points within the desired range also increases. This is illustrated with examples from a 1D line, a 2D circle and a 3D sphere, where the probability of finding two orthogonal vectors in the given range is zero, non-zero and greater than zero respectively.
In this section, the model is trained on high-dimensional spaces, with five features and two dimensions. The importance of all the features is one, except when the importance of one of them is varied. This is known as non-uniform superstition. The graph shows that as the sparsity of one feature is varied, the importance of all the others remains one. The lemma states that if some error is allowed, more vectors can be packed than expected. This is not related to the values in the lemma, but rather a general point.
Researchers studied the effect of varying the sparsity of features on the Pentagon Diagon phase change. They found that when the sparsity of a single feature was varied, the Pentagon regime changed when it crossed a certain range of sparsity. They then studied the effect of varying the sparsity of pairs of features, which were linked so that they both had the same probability of being zero or non-zero. The embedding dimension was set to two in order to observe the results.
A picture is drawn to indicate correlated features, with X1 and X2 correlated and X3 and X4 correlated. The model prefers to embed these dimensions orthogonally, so X1 and X2 are orthogonal and X3 and X4 are orthogonal. Anti-correlated features are also discussed, where if one is non-zero then the other must be zero. Training of a model is also discussed, with individual colors representing specific features, and the x-axis showing the length of the training.
Two graphs are used to illustrate the concept of finite dimensionality. In the first graph, features approach a finite dimensionality over training, with some features going to zero and others to a half. In the second graph, features are correlated and move away from each other until they form an octahedron shape. The top row has two hidden dimensions and the bottom row has three. As the features approach more stable solutions, there is a drastic drop in the loss.
The paper discusses how a decrease in loss can be seen in a graph with circles representing regions of high slope. The author suggests that this could be a phase transition, however it is difficult to make this case without looking at the first two rows of the graph. There may be evidence of a phase transition in the training steps versus lost graph and other graphs, however it is difficult to make this case without further evidence.

Raw Transcript


all right let's do this uh I'm kind of assuming most people have either watched part one or read the paper but knowing me if I was attending this seminar I might not have done that so I'll do a quick brutal introduction so that you can hopefully have some ideas as to what's going on Okay so where considering a idea in that kind of network learn certain features about the world where features are discernible objects that we humans can understand for example can an image classifier uh have a neuron that will fire when it sees a cat and vice versa if they see the cat neuron file will we be able to realize that the neuron has witnessed a cat in its image classification so this is more or less the idea behind this paper is trying to align neurons firing with discernible features thank you the interesting idea they had was to consider features as vectors in a dimensional space so let X be a feature which will be an r-dimensional Factor sorry an R to the N dimensional vector then what they do is they try to see if they can in a sense have this input of the network and then sort of act on it with this w so now this here will be W Times X inside this layer where we have M neurons here and N neurons in the input where m is much less than n so I think in the experiments n is roughly 400 and M is roughly 30. and then they act on this oops with a another Matrix we transpose so then what we've got here is W transpose WX and then they act with a value to get the final output so the output of the network is value W transpose WX and I think and they had a bias in here sorry and then the idea is if we call this x Prime is the loss function that they use to check if the network is in a sense of learning these features is they are taking the loss of the input Vector X and the output factor x Prime which is the value Network so it's also worth mentioning that their experiments use sparsity as a parameter of these networks so we're going to say um each x i which is a feature will be drawn from a uniform distribution between zero and one and it will and each Theta inside the vector will be equal to zero with probability so oh dear sorry the lights are turning off from the 700 room in hang on one sec they're evicting you all right I'm actually hiding the design department so I feel like if someone discovers me talking about Maps still get upset anyway it's the true Design This statement's a bit it's a bit confusing right I mean you've given two sorry so you mean that somehow with probability SI at zero and
if it's not zero then you sample it from a uniform distribution right yes sorry that's exactly what it is yeah yeah so we're next time not zero it will be taken by this random distribution between zero and one and then the last function that they use in these experiments to assess whether the network has in a sense learned these features is given by this which is a mean squared errors loss and this is something they call the importance and this idea comes from say if you have a image classifier that's trying to diagnose an illness for example it might consider features about their illness to be very important whereas features such as has the person who's got freckles on their hands to be quite unimportant and so this can vary with each feature in most of the experiments though they decided to be one so I think for most of the stuff I'm going to be talking about today we can just take the importance to be one and then we also will sum over the feet the uh different distributions of X so that we can have make some experiments to look at okay that's kind of like the brief intro to what Sam talked about now last time sorry is there any questions before move on so the main section that I want to talk about today is the geometry of superposition section and actually dang can I get the image zero on the board next to where I wrote geometry I think you might just gotten back foreign on the next board please okay um so one metric that the authors used to look at the geometry of these network is something that's called dimensionality and so I write that down first sorry man turns out to be extremely hard to see you might click on it if you want to see it better oh yeah okay so from the previous board we have W here is the parameters in the Network and F is the provinious norm and this is the number of uh nodes in the middle or the dimension so sorry I probably should say w is going to be a m by n Matrix and you can actually decompose the convenience Norm into being a sum of The Columns of w so we'll have n columns in this n by n Matrix and so what this dimensionality is telling us is it's kind of a ratio over the features kind of like the dimensions per feature since we have dimensions of the embedded embedding space here kind of impossible to tell who will win and he will lose but it's possible sorry I must have been on a digital device yeah so it's worth noting as well I have roughly that wi will be approximately equal to one if a model has learned feature I and it'll be roughly equal to zero if
the model has not learned it so in a sense this Dominator is kind of like the number of features that it has learned and sorry that should be an N up there because this is the sum of the features n ope so we're trying to embed n features into M Dimensions where m is much lower than n okay so what they did was they plotted this more or less as a function of the sparsity on the right hand side so if you're going to tap on that now to have a better look they do some rescaling they use a log scale and they instead of having uh s they have one on one minus s after playing around with it a bit is because it makes it much more easy to look at the graph it's quite exponential otherwise um but the thing that's fascinating is that the dimensions per feature likes to sit around this one half mark for quite a while it's more clear in the next graph um and apart from that it sort of will change quite quickly so they call this a bit of a phase change between these different uh sticky points as they call them I'm not sure if I'd I find this picture I'm being a bit unconvincing that it's a true phase change but it's still interesting especially this one half Dimension seems to stick around for quite a while in the next in um the dimensionality graph but even as the authors admit they believe that this is much more like a Quirk of the model they've picked and I think I agree with them because I'll get to it in a second once we talk about the next picture so this picture I stand up for a very long time uh this is my crazy one yeah multiple different covers and things so I'm gonna talk it through I also read through the notebook that produced this graph uh so I think I have some understanding of what's going on except for these like shape diagrams I've done at the top I'll give my interpretation of what they mean but I can't find any definitive way the authors created these graphs all right so let's start with the x-axis so you might notice that there's clumps of uh points so for example the bottom left here I'm going to Circle it I think everyone should probably click on the image for this section sorry so you can see what I'm doing so there is 30 discrete intervals where they change the sparsity so the sparsity takes discrete values uh between I think it's 30 values between roughly one and roughly 90 percent so it goes from like not as fast at all to very sparse and then the dimensionality at the top we can see let's look at the pink line for a second uh you can embed one feature in one
dimension that's just the point and that's what happens for the first sticky point because this leads to that being very low dot products between the features because if you have uh two features like many features in a low dimensional space the dot products will be quite large typically which is what we see happening over here so to be clear let's look at the pictures at the top now the ones up here uh wait before you do on the left hand side I'm confused so one feature in one dimension but we have way more features than dimensions so but that's just what we're doing right it's sort of like picks for each feature yep there's some some overlaps but that's fine that's what's happening there is that right more or less yeah so I think so the features was 400 and the dimensions was it was 30 in the paper and then it was 20 in The Notebook so I'm gonna say it's 30. um so in the first in this first section here they just embed 30 features in 30 dimensions so this is just like you know perfect one every feature gets its own dedicated Dimension there's no inference and that's great uh this is in the very dense region where this Varsity is you know zero s equals zero down here so extremely dense H3 to get its own dimension then as we move along here we'll see the first picture they've picked here is when the scarcity is roughly about 25 to 30 percent and now we actually managed to get oh sorry let's do so there's 20 dimensions and there's roughly 29 features there is 29 features in this one so now I've managed to fit in nine more features than there are dimensions features is varying what I thought I mean I thought it was so we're embedding we'll fix many more features are you saying that like the number of features we succeed in embedding or you mean the number of features yes the number of features we have is 400 when the smart city is zero we embed 20 features that's the bank line at the very start when we increase the sparsity a bit we managed to embed nine more features but this resolves this results instead of all the features being their own point in space we have these shapes now what these what these shapes mean so pick a color dot if it is not connected to another color dot that means that that dot products between those two vectors is zero if it is connected to another colored dot then that is it's a DOT product proportional to its dot product so features that are these dots that are very close together so in these like Diagon ones these have very high dot products and then as you
can see over here when we embed more features they're more spread out and the dot products are lower so we managed to embed 29 features in this first picture and then as it goes on it gets to roughly 60 features here and then 80 and then I think I stopped counting this one it's roughly about 100 in there and so as the sparsity increases we manage to embed more and more features than we have dimensions and at each it seems the funny thing is though is that instead of continuously changing like the predicted dimensionality graph so it's clear that for each Varsity interval uh the dimensionality falls into either being zero down here or it picks a very finite uh dimensionality and the one that seems to like the best the Diagon zero down where at the bottom of the graph sorry the dimensionality is zero down here and it's one up here and as you can see the features are either embedded at the dimensionality that is along here or they're not embedded at all and then they're just left at the bottom oh yeah I'm confused so there's there's an so this number that you're putting along the top like 29 60 and so on that's the number of features that it's learning right can we can we not just can we not say number of features because that should mean 400 like the number of uh like reconstructed features or something yeah okay okay but then on the right hand side this ratio is uh number of dimensions but what is that what what are the what are the numerator and denominator here I mean it's the denominator the number of reconstructed features or is it must be is that right oh my God I'm so sorry I forgot to explain the ratio here is different to the previous one but it's completely my bad okay yep that actually makes sense uh yeah so this is still dimensionality except now it's slightly different than [Applause] the previous one all right so before so his ratio was the total number of Dimensions full stop I mean the features divided by the number sort of successfully in that well oh yeah okay number of embedded features I guess yeah no no so m is m is not the number of features m is the number of hidden Dimensions yeah yeah Amazon Dimensions so I'm so sorry about that everyone that was quite that's quite embarrassing so what's happening now is that the feature dimensionality is this new formula I've written here right so if we embed a dimension remember it's roughly equal to one and then the denominator now is how much that's made up of other features so we've picked the dimension I
pick the feature x i that will be Associated that will have the direction in the n-dimensional space wo and then we can take the dot product with its you know uh basis vectors wi with every other feature and then we can see how much that feature is made up of other features in other directions yeah yeah exactly it's going to be a picture would help you so say if you've got this feature W2 up here then this would have been made from potentially like other features like this one this one percentage as you project it onto each axis okay so back to the image so sorry Daniel what was your question again uh yeah I think I'm happy now okay so now the reason why I think I mean this is conjecture but the reason I think that this one half Diagon pair is bless you it's quite popular is these if you look if you look back at the value model across here we can see that it can't actually be negative because the X i's are taken from zero and one and the value is also taken between zero and one since there's no like constants on the outside and so the authors think uh numbers perhaps the model's not scared in a sense to put two features in the same axes in the Subspace so all those digons for example will look say this is some local Subspace uh two inside R to the N which is after the 30. then this Diagon pair we can see if we look at this graph again that the two features that are in this pair will be a completely opposite each other so they have very strong dot products you know almost overlapping but they're antipodal and I think this is because the model seems to think that it's quite safe to take the risk that it won't have to make X1 it won't have to make one end of the Diagon if it has to make the other so as the as that's why we sum over the x's and the loss functions so as the model like does this many times and it thinks okay if I have spot probability s probability of this Varsity yeah sorry if x i equals zero with the probability being this Varsity then if we look over at this graph over here you know the uh pick a Diagon pair there's like 30 Diagon pairs in that in this yellow ring if one of these antipodal pairs is zero well if one of them is non-zero there's a very good chance that the other one probably is zero and so the model feels quite comfortable putting these things into Diagon pairs because one of the benefits of doing this is that it gets a lot of orthogonality so you can see that when the sparsity is between 25 down here and up to 75 over here this whole region it likes to
use this Diagon pair which is quite a long time comparatively to the other shapes it takes and my guess is that because you get a lot of nice orthogonality in this case most of like every second pairs or 12 at all and when they're not they are in the same Dimension which is not good but the tears you've got to represent both features simultaneously I think because the value output is always positive is quite low I could be wrong about that that's just my interpretation so when they show these um shapes over on the right they're showing so it's not I mean to the kind of configurations that appear but at any given sparsity many of them might appear red um so it's what does it actually mean to color it in like that's the predominant so they color it in if it's the predominant one here so it seems at this Varsity range it likes to mostly be you know these nice tetrahedrons uh at this point here we have you know some some diagons some triangles and then for this big region it's mostly just diagons until you know we get to over here or we start getting this has the most shapes up here and then as the dimensionality trends low to these like crazy weird Bagel shapes uh this is gonna try to learn a lot of Dimension sorry a lot of embedded features and it's because it's because now the sparsity is so high it's pretty happy to let most features that are embedded have a non-zero dot product with other ones so and that's why you get this Bagel shape so we take the one that you've labeled 80 above which is where the orange Pentagon line meets the the Curve so if I look at that picture I guess I see pentagons those are the orbiting things there's like a pentagon of pentagons um yeah but that then there's like this enormous mess in the middle so am I correct in interpreting this as like the only recognizable shapes are pentagons but there's only five of them and then there's a holy mass of stuff in the middle but that has no geometric like nothing interesting to say about it or is that secretly a whole bunch of pentagons in there or something I mean it was just a mess five pentagons then I don't um I also had this question when reading this and I I scoured the notebook and the comments and everything I could find I haven't watched a few YouTube videos to see how are they making these pictures like what is their method uh and I could find nothing maybe I'm I did the most racist I can if anyone's found it please let me know my interpretation though is they're doing some kind of like principal
component analysis or this technique where you can represent High dimensional matrices into all vectors sorry into like 2D space so my guess is that maybe it's just measuring the radius from zero and it's just like plotting the angle away and so maybe that big mess in the middle is they're not actually that like those those features that are embedded there don't actually have super high dot products they just look like that because they've been projected but then that doesn't explain the nice pictures in the far left because they seem to be further away are you saying you think that so if I could see in 20 dimensional space then I I'd see this beautiful configuration of little pentagons I think yeah that's what I I'd say sorry sorry sorry but each I thought so each point in this represents a vector right but it isn't actually the vector yeah and then the lines are just the dot product represent the dot products between each of the vectors so this doesn't need to be a projection right for that to make that plot this is just saying that there's some sort of you know 25 kind of special features and then a whole Clump which are really strongly overlapping with each other right yeah it most certainly is a map or some kind of principled projection from 20 down to two so it's likely that the ones that are in I mean the pentagons they're not sort of distributed far away from each other right so things that are close together in this picture probably are relatively similar vectors According to some kind of so the fact they all Clump at the middle it's probably not completely meaningless yeah I think they also have this picture they draw I'll make it here for a second where later in the paper they showed uh wtw as a matrix and this is the case where n equals three and sorry and there's and it's still large and equals three now and then this Matrix looks like one negative a half negative a half and this is the message they have and then they Express this as like columns this is like a 2d shape like this and they call this shape Columns of w and then they do the same thing with like a 4D Matrix and draw a square and to me it's not super clearly how they're doing this but I assume it's the same method that they're using in this dimensionality graph um I think you're right they're wrong that it's like the the dot products are basically what's the the line between the the features are um it is strange to me though that like it is instrumental oh sorry no sorry I wasn't saying anything
I heard my advice um yeah and so it seems like if it has a sparsity if it was getting more and more fine then the model would create this like huge Bagel of as many dimensions as many features as it could in bed all with probably quite non-zeroed up products yeah but that doesn't really explain the I mean that that makes sense but it's why have a pentagon of pentagons that requires a slightly more complicated explanation yeah I think the the change from the Diagon era or the Diagon solution section in the middle to this one here where we have you know the The Bagel in the middle and then a few quite nicely orthogonal shapes that aren't connected and then it seems to quite quickly but change into the shape you're talking about here and even though the importance is all constant here so it should embed each feature with the same importance so there's no reason why it should prefer these colored shapes more than the ones that are in the bagel all right um yeah is there anything more we want to say about this picture so now what the authors do which I will talk about next is they connect this to a famous problem in physics and maths called the Thompson problem so I will present that and then we will discover that this thing is actually very hard to solve unfortunately so let me present the problem and then I'll connect it to the the graph what's going on here so now we're gonna let uh potion use x because it's going to be confusing let's let UI be a point on this on the sphere and we're going to let W be the potential energy so that's going to be w is equal to the sum over these points on the sphere so typically this is thought of as these points on the Zero electrons and you've got some sphere and you're trying to find the way that you can place all these repelling electrons on the sphere such that the energy is minimal so you want to spread them out across this sphere so that they they yeah minimize this function w so to be clear foreign some pretty clear solutions for very few points so if you only have two points clearly just putting them on the Northern South Pole or any like rotation or affection symmetry like that will work and then if you have three points you can put them around you know some kind of like equilateral triangle around the sphere but this problem gets very hard very quickly uh I think there's only been Exact Solutions there's only been Exact Solutions found for this problem up to n equals 12. and computer aided search algorithms that I think I only got up to 400 Maybe
um roughly that that's from my research of how many have been solved so this problem is quite hard it's actually one of the like most famous one of like the 18 famous math problems that are unsolved I think you win a bunch of money if you solve it uh so that's a bit of a shame considering that if you look at the dimensionality graph it seems that these features are in some sense configuring themselves into these nice shapes that actually look like these Thompson problem Solutions [Applause] is there anything beyond that comment you just made in the paper I mean obviously it's a natural thing to guess like I suppose but is there any reason to think that the I mean we have two minimization problems right we're trying to minimize the error the Reconstruction Era and that's that's one loss and here's another loss but I don't it's not obvious to me that they really give much evidence mathematically that those two are connected is that right yes uh they don't give much evidence mathematically except to say that that look the shapes are the same um so I'm not super not super sure how relevant solving this one this is my impression from yeah and I didn't read it as carefully as you but I mean they see these things like square anti-prisms and I feel like that is also something that comes up in the Thompson problem and therefore they feel like the connection is is quite strong but that it feels like about as far as it goes unless there's some part I haven't read in the appendix that makes this connection no I eventually attend this as well and I don't think they okay it's gonna be on saying the shapes look the same which is yeah so I think that's maybe not a bad thing for us considering that sort of thing this problem is somewhat very hard well solving the Thompson problem may be very hard connecting this reconstruction problem to the Thompson problem is almost certainly easier than that although that's not saying very much and that's fine it's I'm not criticizing the the authors but I just wanted to make sure that I wasn't missing something here yeah no no I agree [Applause] um another thing that uh I thought would be worth mentioning since it's quite relevant to this is this Lemma which has put more precisely exactly how many say some kind of like upper bound on how many features could you feasibly embed into and dimensions and this is called the Johnson London Strasse summer come on okay so we're going to let K be a positive integer and n and we'll let Epsilon be some number
between zero and one then if K is greater than this number here then any set of points v and end points and R to the D there exists a map from out of the day to out of the k where K is less than d such that we can bound the length of the two two vectors like so okay so what does it say this is basically saying that given a Vector in a set of event pick a random set of random Vector in a high dimensional space then we can project it into this lower dimensional space with it preserving its length up to a scale factor of one plus or minus Epsilon thank you and we can do this if it was like a random deck to say these people at random we could do this with probability uh of order one over N squared so if you're in a very very high dimensional space say n is very large um sorry uh-huh sorry you will fail to do this with probability of of one N squared so if n is very very large there's a very very good chance that you could do this um and this is what was the point of writing U minus V here oh F doesn't happen right okay so then of course not linear in which case the way you said it doesn't doesn't really correct right it is significant but it's so it's preserving distances with some yeah it is in the proof they in the proof that they do use a linear value for the F like the one the match right there exists is linear so what do you mean sorry oh okay but if it's linear then there's no point writing U minus V right because F of U minus F of V is f of U minus V so you might as well just replace this by like a single vector yeah that's a good point okay yeah sure I could have just written that okay but you're sure that the statement is there exists a linear f uh it just says the statement the same is Alfred say this exists a map and then in the proof I read they chose one that appeared to be linear to me okay I think if you had a vector well it matters very much what that kind of map is otherwise it's kind of either total logical or the map used for the embedding is at least lip shits and can even be taken to be an orthogonal projection ah there exists a linear map or the statement on Wikipedia says linear so sure okay sorry but then okay also the statement on Wikipedia has the way you wrote it with an F U minus FV which is kind of weird but all right yeah fun I think it's just it might just be the way it's presented because the profile men of uh also used minus V well maybe that's where Wikipedia to go from um I don't know if uh I can give the proof If you think that would be a good just
fat time or I could sure yeah move on yeah all right uh I'm going to probably just prove one of the inequalities since the other ones pretty much the same except modular just the uh mono plus the minus Epsilon okay so we're going to assume that uh K is less than D otherwise it's probably obviously true um then I'm gonna take a random key to mention all Subspace best and we'll let VR do the projection CI Prime beta projection onto sorry be the projection of the point into s okay and then just gonna assign some variables to make the rest of this a bit more concise we're going to let L be this quantity and near be this quantity thank you okay so this proof will assume a bound that I can either prove at the end or depending on how much time we have so the band we're going to use is going to call a is if beta is less than one then the probability that L is less than or equal to later on K Over d will be bounded above by this I'm not following so V is R to the D is that right uh yes and where do the VI's well these VI is a point in V where V is uh I said a set of end points in order to do okay I see so all right a projection uh say the projection well okay so it would yeah okay all right I got you so this is bounded by this quantity and I will prove this bound at the end okay uh I guess we've got an Xbox uh yeah that's a good question no I was agreeing okay so applying that band that wheel takes now we can ban this above by this quantity so then just simplifying a little bit and here I'm using the fact that since Epsilon is between 0 and 1 this logarithm will be bounded above by this quantity here which is the first two terms of the logarithm and I'm just canceling that and we'll spend that up we get this okay but we're assuming that K is greater than a certain value in the Lemma so I'm doing some simplifying so then K Epsilon squared of four is greater than 2 log n plus K Epsilon Square on six and then multiplying both sides by the negatives we get this okay now applying this to the exponential I think you want to change the direction of the inequality don't you I definitely do yes that's why I did that let's see that's good that's good oops okay uh then we can just put his credential this you lost the B between I did thank you this will be between this will be small this will be between zero and one so this will be bounded above by just this term okay um okay uh so in a similar in a so this is the one side of the proof for the one minus Epsilon of the inequality
and then can I just run through what you've done here and see if I've understood this correctly so mu is the L is the distance between the projected pair from this set this sample that you took the meter was the distance before you projected and then we've shown that as n gets very large with very high probability there's an inequality is that right exactly cool um so similarly we can also show that so sorry just a maybe write out again so we have this and then on the other hand by a very basically the same calculation and a similar assumption we'll also have this okay now we're going to let this FB right now I'm confused why shouldn't it be one plus Epsilon in the upper bound and one minus Epsilon in the lower bound uh so what we're going to show here is we're gonna we're trying to establishing the section where this won't work if that makes sense so we want to oh yeah yeah okay yeah that makes sense yep okay so now we're going to set that linear map today at least earlier so this is a scale up in the front all right when I first saw this Lemma I thought oh there's some clever linear map that does it but that's really not the point it's like no basically you find any linear map like you pick any projection and it works I guess this is scale factor but who cares yeah so now I'll just substitute once again in my mu and my l I just defined this to make this shorter to write every time okay then I can stop shoot in my f [Music] ifth oh sorry I can't do this on here sorry I moved the DMK to the other side and then I bought it inside the square root and then I'm going to call this thing my f it's only replaces with that I didn't actually just sub it off there all right okay [Music] see I think it's quite important for us to work for it to be linear as you can see here okay so let me get this is bound like so I remember this thing that we cruised earlier is bounded by one on N squared so a copy and paste feature on this one [Music] [Music] okay so what this is saying is that the probability that uh this quantity inside the probability sign here is not in the range of that's right up Okay so in the worst case scenario the probability that this quantity star which is the the division with the two Norms is not inside the range that we want is at most 200 N squared Okay so if we were to then say foreign they lie outside the desired range is going to be at most into these two points where the probability that they're outside the desired range is now 2 on N squared and then n choose two can be expressed
as n minus one and on two so this is the this one minus one on n this can be thought of as like the lower bound on the desired probability of what we want so this is like the worst case scenario is that the probability that you pick two points and they lie in the rounds you want is going to be this number here so if n is very large there's you say if n is like a thousand then there's a very good chance that the 2.0 picks will be inside the side range because it would be like very close to one so therefore we have one minus Epsilon is in that range with probability at least that and that's pretty much the proof uh modular proving that band would I assumed um yeah I think we can skip that yeah agreed uh Hamilton time yeah it's worth seeing um so we're going back to the toy models yeah I wanted to yeah yeah I want to talk a little bit about the non-uniform super Collision case oh I think also maybe I'll show a big picture that I I find helpful um so like say you've got a one-dimensional line of like a 1D vector and you say what's the probability that I could pick another Vector in that space that is orthogonal to this vector well here it's zero the question doesn't really make sense but if instead in a 2d sphere a 2d Circle if I had a vector say going from zero to the top then the probability picked to a vector that is say this is distance Epsilon there's orthogonals without is going to be inside this range here and then similarly on the sphere that's gonna if the vector that you want to be orthogonal 2 is this one well then you've got this whole equator of choices to pick one that is almost so you can see and then I'll leave the 4D cases of exercise for you guys later to visualize but you can see here that um each time the dimension grows you get much many many more choices to choose factors that are available so that's why this Lemma was uh is of importance because the model in the toy models paper once or for once vectors to be a functional because these vectors are features and we don't if we look at if we like want to look at a model and be sure that this feature is there and not just some crazy combination of other features well if they're orthogonal but it's quite clear that the uh the feature is the one that is present okay let's maybe get the next picture up down this is the sorry not not I'm not using the number system B grade on number two three three yeah that's like the Pentagon and the Diagon again I think the link between this Lemma and what they're saying is it's
not meant to be taken two literally right so for one thing I mean our projection is definitely not a linear map but somehow presumably we're doing better than the the Lemma would let us do is that right I mean like somehow they're doing better well the motivation for the lemmer is to say that if you allow some error then you can pack more vectors than you think but it's not like the precise like uh you know whatever figure it was if I open up the Lemma again um dilemma says oh so you're saying the model the models are lying for a much larger error and just packing way more stuff into it well I mean there's this particular lower boundary K has to be greater than or equal to four on some polynomial and Epsilon red yep and so it's not like these particular values in the Lemma really have any bearing to what we're seeing the model do right it's just a general Point yeah if you allow some kind of non-zero dot product you can fit more than you think or something like that yeah exactly especially in Supply dimensional spaces like the ones that this model is trained on okay so once again we're gonna look at this graph a fair bit so now we have five features with two dimensions and once again the importance of all these is going to be one except when we choose to vary the this past the importance of one of them so this section here this picture talks about non-uniform superstition which is the case where not all features have equal importance so I think it's interesting here is that once again as a function of sparsity as you vary the sparsity of certainly interested in bin but now I'm confused so I thought all the experiments we were talking about had this kind of Decay schedule for the importance where there was one direction that had importance one and then the next one was like 0.9 and the next one was 0.9 squared no no the last graph sorry was All Uniform important and the weird shapes one that all had important one I see okay thanks yeah this is my name they it's they jump around a little bit and this is when once again they remove the importance of all of them so actually so actually in this case as well sorry this is uniform importance because every yeah sorry this is non-uniform position that's the difference not the importance yeah the importance is still one but somehow not all the features are equal because they've not instead of like before they were varying the sparsity of all of them simultaneously not this one yeah yeah but now they're just doing one right
and I think the idea of why they do this is they can show that uh they want to show that uh uniform supervision is in some sense a perturbation theory of the uniform case okay um so we can see here on this Pentagon Diagon phase change at the bottom we can see that the the top left graph stays as the digon regime for certain values of sparsity of this fifth one that they vary in the middle and then as soon as it crosses over this certain range of the certain value of sparsity then it switches to this Pentagon regime and then as it gets more and more sparse these arrows show how the Pentagon changes as it goes up the Lost curve um they don't really go much Beyond this one graph for the non-uniform case they just sort of use it as a yardstick to say hey look it seems as though we can think of non-uniform sparsity as perturbing the sparsity of one so they fix the sparsity of all others and then just change the opacity of one feature and see how that affects the um the graphs and then they do a similar experiment next where they will vary the sparsity of pairs of features or what they call correlated features so I don't know if I don't think I sent a picture for that one but in a similar vein I do have one more picture is the other one oh there's a couple more uh oh no sorry there's other ones yeah um yeah the correlated feature section is instead of varying viscosity to say one point they fixed the spot they vary this Varsity of pairs with points um and these pairs of points will now be zero with will now be non-zero with the same probability so maybe I'll write this down foreign spaces then the model prefers to so here I'm going to say we have four I don't understand here so they're non-zero with probability P equals s i what does this mean so yeah so there are other there are other both zero probability I saw right or the neither zero so they can be thought of as like being linked in there okay so when you're when you're sampling I mean the definition of the distribution of vectors is for all the other one all the other coordinates you do what we said before but for these two you use you sample according to the probability s and then if if it so with probability s you're supposed to actually sample both of them independently from a uniform distribution X1 and X2 yep uh otherwise they're just both zero is that right that's exactly right yeah okay thank you so in that case so let's have severe four features and the embedding Dimension is going to be two just so I
can draw a picture um and I'm going to indicate correlated features perhaps just with color so here X1 and X2 are correlated and X3 and X4 are correlated so then when the model tries to embed these dimensions it's prefers have correlated features be orthogonal which will make sense since you want features that occur at the same time to be orthogonal because you want to be able to discern which features which so this is one of the results of the model is that it embedded these features like so where so X1 and X2 are orthogonal and take three and X4 so the model can because since someone was trying to tell that these features always occurred together or they didn't it seem to likely put them in orthogonal spaces similarly there's anti-correlated features which as you might guess is if one is non-zero then the other is zero so say if these are anti-correlated if the X3 is drawn from the uniform distribution and not zero then X4 must be zero and vice versa and similarly for these ones if X1 is drawn from the uniform distribution between zero and one the next two is zero in that case it's quite similar but it would like to put them in the same plane since it knows that these features never seem to occur together which once again is makes sense you know it uh once you get the ones that uh in nice pairs um okay actually let's just let's just uh maybe post the next two photos and other family talk over the last five minutes all right so what do you want to say about these correlated features I mean these are definitions but what's the what's the point of introducing these oh um mainly that the model can also learn to do the same thing if you vary the sparsity of one feature and a few various opacity of multiple features but it doesn't break anything it still doesn't because it may be unsurprising okay uh yeah let's move to the next set of boards and I'll post the pictures there that's all right yep just one next one I'm not sure yeah so these are ones that will indicate what's happening over the training of a model uh to my knowledge these aren't probably stupid relevant for SLT just because I think we don't really talk much about like gradient descent or Tyler models are trained in the Bayesian context um oh yeah thanks but if you look at the left hand side for a second this every individual color here is a specific feature and on the bottom of the x-axis for both our the text of the training so how how long it's been training for and the top one shows the future
dimensionality like in the first graph all the funny pictures and all the different diagons and trigons and tetahedons and the bottom one says the last so looking at software for a second we see that each feature will sort of approach a finite dimensionality over training so this is just the model learning to put them into the feature into the dimensionalities it wants to have them in so as you can see in this case they all either go to zero not being learned or into a half so this is going to be example of the Diagon solution where we have pairs of points in local local R2 subspaces the thing that's interesting to me though is that it seems to the loss seems to change quite drastically once it like moves around these features so it seems to realize that if I put this feature into this dimensionality then all of a sudden you know it I get a I get a big I get a big drop in the loss I get rewarded quite well um which I thought was interesting considering that the importance is supposed to be uniform in this context and so I don't see how putting that feature up there there's also been such a good uniform boss it could just be a certain amount of features at this stage um and the graph next to it is also quite similar it's also showing features that are sort of this is showing the uh the shapes case so the left hand side just shows the dimensionality and the right hand side is now we've got uh some correlated features so here the blue ones are correlated and the green ones are so the blue ones will occur together and the green ones will occur together and you can see initially that if you look at the B section they seem to be moving away from each other together into like as far away as possible and then once they hit sort of the roof of their capability they start to spread out in this 2D plane until they arrange themselves into this uh octahedron shape so there's two and each time these just represent different in Hidden Dimensions right so the top row has m throws it n equals two and the bottom is three is that the only difference here between the two graphs sorry or the the two rows of of the right hand board here so oh yeah yeah um I think the no the top one just like the top down view it's the same it's the same picture oh I see okay yeah although maybe it actually would be the same if you embedded into any posterior that's interesting um but yeah it seems that as it approaches more and more like stable Solutions you see like this like very drastic drops in
the loss like for example here and then a bit at the end when they're spread out and they get a lot when they move down um then you might know better than I but I don't know if LLC would have anything interesting to say about training data or what happens every training I mean I guess it's just like it what happens is what happens it gets it learns to the stuff better and it finds a better arrangement for the features well either over this slide just is nonsense or it's occurs in a higher percentage of training runs in which case it's nothing to do with any particular trajectory it's more to do with the structure of the Lost surface right so yeah if it's to do with the structure of the Lost surface then well presumably these circles you're drawing represent regions of very high slope um potentially uh but yeah what that has I mean it's tempting to look at this picture and you know think about it as being something like a phase transition but uh yeah and it's maybe not crazy to think that but uh the authors seem to call it that or they seem confident that is what's happening here um yeah I think uh I think it's appropriate language potentially but I do worry a bit that people are jumping to use the thermodynamic language in like a sort of colloquial fashion um which might dilute its utility if it's sort of comes I mean if phase transition comes to me in any time a loss plot decreases quickly then well that's uh actually not very helpful but more convincing is I suppose that it the difference between the second and the third column uh and the first and the second column okay those those would represent different kinds of solutions so they would probably be you know the transition between them does seem like it's likely to be a phase transition and if there is a phase transition you might see it in the training steps versus lost graph and you might see it in other kinds of graphs as well but uh I don't think just looking at this graph on its own I mean if I just saw this graph I would be quite skeptical emperor that it's evidence of a phase transition but combined with the first two rows the thing above it I I think it's probably legitimate although one I think has to argue a bit harder to really make this case no I would agree okay uh since there's only a few minutes left I think I might just stop there um yeah thanks a lot better any questions please once again this is much more like what I got out of reading this paper um maybe others could obtain further knowledge that I couldn't