WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
In this video, Dan explains the differences between Recurrent Neural Networks (RNNs) and Transformers. Transformers are better at learning context as they can map contexts to sub-distributions within the data. He also discusses in-context learning, gradient descent, parameter space and contexts and how they relate to each other. He proposes a one-layer 10 H network with a repetition of weights to handle arbitrarily long inputs, or a recurrent neural network with two recurrences, to demonstrate in-context learning. Finally, he explains how the W Prime equation can be used to understand the inner workings of Transformers.
In context learning is a Transformer model capability which takes a sequence of tokens as a prompt to configure the Transformer to predict the output. It uses query, key and value weights to output a probability distribution of the next token which can be used to calculate the cross entropy of the distribution. Currying is a programming language process which involves fixing one of the inputs of a function with multiple arguments, thereby producing a function of one variable. This concept can be applied to the Transformer to produce a new model. When making a prediction, a Singularity, W, is very complicated and a W Prime may exist to help with currying in order to produce a function of the same class.
The paper discusses gradient descent, parameter space and contexts and how they relate to each other. It also looks at how contexts can be mapped to directions and how a small update to weights, known as a treacherous turn, can drastically change the output. It further explores the concept of in-context learning, where a language model can follow instructions and become an identity function. This is achieved by using a Transformer model with attention, which propagates information from left entities to the right and itself in multiple rounds. The difference between TW Computing Y and TW with context hitting Y is in the attention given to the entities, which can change the entities in the right-hand box.
The speaker is discussing how to calculate the gradient update for a given entity, suggesting that the value vector for each entity should be grouped together and then the log of their sum should be taken. They then propose that if there is a clear winner, the gradient update looks like the largest one, and this could be either a frozen or unfrozen entity. The speaker suggests a different approach to understanding the learning theory, involving computing the gradient update of a Transformer to a particular weight, and proposes looking for the simplest example of a model to demonstrate in-context learning. They suggest a one-layer 10 H network with a repetition of weights to handle arbitrarily long inputs, or a recurrent neural network with two recurrences.
The speaker is discussing the differences between Recurrent Neural Networks (RNNs) and Transformers. Transformers are better at learning context as they can map contexts to sub-distributions within the data. This is done by mapping a prompt to a conditional distribution which is smaller than the full joint distribution of all possible sequences of tokens. All weights are in the same weight space which is the novel feature of Transformers. Dan is discussing the Transformer model and how it can be used for fine-tuning and questioned whether this can be considered learning. The W Prime equation can be used to understand the inner workings of Transformers, allowing them to move to different parts of the parameter space to better understand what they are capable of.
In context learning is an emergent capability of Transformer models. It is not clear why it emerges or what it is. The architecture of a Transformer consists of an embedding layer, multiple Transformer blocks, an unembedding layer, and a softmax applied to the final vector. Layer normalization and the feed forward layer are ignored, and attention is focused on with a single head. In context learning is when a sequence of tokens is used as a prompt to configure the Transformer to be the question. This example was used to explain the concept.
The Transformer model is a type of model which takes a sequence of tokens and outputs a prediction of the next token. It consists of three sets of weights - query weights, key weights and value weights - which vary between layers. The Transformer model is unique in that a fixed set of weights determines an infinite family of functions, each of which outputs a probability distribution of the next token. This is an important concept in contextual learning. This concept is applicable to any model which outputs a sequence, such as an LSTM.
In context learning is a way of mathematically modelling how a sequence of tokens can be used to predict a response. The transformer takes an input sequence of L tokens and outputs a distribution of possible tokens. The cross entropy of this distribution is used to calculate the loss function. When a second example is given, the same transformer can be used with the previous L+1 tokens to give a new distribution which may be lower in cross entropy than the first. This suggests that context can help to improve prediction accuracy.
The speaker poses two questions: Is there some similarity between in context learning and learning by gradient descent? And is this function similar to this function? They explain that providing a prompt may not always help, and that there may be some magic learning rate for which the second question is quasi-true. The speaker then poses a slightly different question to discuss.
Currying is a programming language process which involves fixing one of the inputs of a function with multiple arguments, thereby producing a function of one variable. Providing context is a form of currying which produces a new model. Question two asks if there exists a W Prime such that when you curry, you get another function of the same class. If true, this means that when feeding large amounts of text to a Transformer, each sub-distribution will correspond to a W Prime. W Prime is a function of the context and the original weights of the Transformer can be thought of as a Universal Turing Machine. When making a prediction, there is a Singularity, W, which is very complicated.
The paper in question makes claims that a single step of gradient descent looks like what you get by prepending a context, but the speaker is skeptical of this. They provide an example to explain why this is implausible, and suggest that parameter space is dotted with child singularities which redirect queries. The speaker is not convinced by the evidence presented in the paper and believes more general evidence is needed.
The transcript discusses the concept of a treacherous turn in which a small update to weights (W Prime) can drastically change the output (TW Prime) compared to the original output (TW). It suggests that the algorithm used to compute representations may be similar in the sense that the way they compute on representations is similar, not just the input-output behaviour. It explores the idea of contexts being drawn from the same distribution, and how this could lead to W Prime being close to W. It also suggests that contexts may be mapped to directions, and that there may be many nearby singularities which can be selected to move to with a small update.
In AI safety, there is an idea that a system might detect when it is in training and behave differently when deployed. This idea can be applied to language, where a particular token can radically change how one responds to future tokens. This could be done in a local gradient step or a more general context step. Contexts can provide local steps and this is what the papers are talking about.
The transcript discusses the concept of in-context learning, which is when a language model is able to follow instructions and become an identity function. This is achieved by using a Transformer model with attention, where information is propagated from left entities to the right and itself in multiple rounds. The equations for the update rule are discussed, and it is shown that the weights from the left to the right are important for in-context learning. The equation is also split into two parts, one from the left and one from the right, to show how the attention works.
The transcript explains the difference between TW Computing Y and TW with context hitting Y, which is in the attention given to the entities. The entities are split into two categories, Frozen and unfrozen. Frozen entities are assumed to be frozen past a certain layer of computation. The difference between the two is that TW with context hitting Y has additional attention given to the normal entities, which are the purple lines. This attention is over multiple rounds, and can change the entities in the right hand box. This is what needs to be understood.
The transcript discusses how attention works in Transformers. If two vectors are randomly sampled, they will be orthogonal with probability one. If the query vector and key vector are not aligned, the dot product will be zero. This is the generic case. In a special case, the context is irrelevant and TW is TW Prime. In a more interesting situation, for each frozen entity, there is an unfrozen index. In each layer, the EI is being updated. This means that the attention will be different for each layer, and the context can be taken into account.
In this transcript, the speaker is discussing how to calculate the gradient update for a given entity. They suggest that the value vector for each entity should be grouped together and then the log of their sum should be taken. This is a commonly used machine learning approximation and the error term is of the order of log n. They then suggest that if there is a clear winner, the gradient update looks like the largest one, and this could be either a frozen or unfrozen entity.
The speaker is discussing a calculation which involves reducing the update of attention from both frozen and unfrozen entities to only non-frozen entities, with modified key weights. This calculation is not meant to prove that the output of the Transformer model is similar, but to suggest that it is important to consider the statistical learning theory of such large models. The speaker suggests that these models may be learning multiple sub-models, each with different effects, and that this coordination of multiple singularities may require a different approach to the statistical learning theory.
The speaker discussed a different approach to understanding the learning theory, which involves computing the gradient update of a Transformer to a particular weight. They then proposed that any multi-input model can be thought of as currying and routing sub-models, and wondered if the Transformer is the simplest model to study this. They concluded that it would be difficult, but that even a single attention layer would be relatively simple.
The speaker proposes looking for the simplest example of a model to demonstrate in-context learning. They suggest drawing a diagram to clarify concepts and make sure there are no type errors. They suggest trying a one-layer 10 H network with a repetition of weights to handle arbitrarily long inputs, or a recurrent neural network with two recurrences. The speaker suggests cooking up a very simple model to investigate in-context learning.
The speaker is discussing the differences between Recurrent Neural Networks (RNNs) and Transformers. They suggest that Transformers are better at learning context, as they are able to map contexts to sub-distributions within the data. This is because Transformers are able to map contexts to singularities or regions of parameter space that are smaller than the full joint. They explain that this is done by mapping a prompt to a conditional distribution which is smaller than the full joint distribution of all possible sequences of tokens. The speaker concludes that all weights are in the same weight space, which is the interesting novelty of Transformers.
The task being performed is to map a context to a conditional distribution, however the true distribution is not known. The Transformer is the learning algorithm which takes in training data and context and produces a prediction. Learning does not necessarily have to mean gradient descent, but rather moving to a different point in W. Providing a context is exactly what is done in the Graybook, however the learning process in W is not moving to different W.
The transcript discusses the learning process in a hierarchical distribution, where a Transformer acquires a configuration and is more focused on routing than on solving individual problems. It also discusses the relation between the overall prior/posterior and sub models that learn the parameters for each component of the hierarchical distribution. The discussion also explores the idea of viewing a function of two arguments as taking them both at once or as a function that takes an argument and returns a function that takes an argument that returns another output.
Dan is discussing the Transformer model and how it can be used for fine-tuning. He is questioning whether this form of information acquisition can be considered learning. He suggests that it is a form of information transfer, similar to gradient descent, where information is accumulated from the data into the weights. He then poses the question of what would be done with the function W Prime of context.
The W Prime equation can be used to understand the inner workings of Transformers. By breaking down the learning problem into independent sub-distributions, we can analyse the Transformer's components individually and gain a better understanding of its properties. It is important to note that Transformers can condition their own model by reading the tokens they predict. This could allow them to move to different parts of the parameter space, allowing us to better understand what they are capable of. Studying W itself is likely to be futile, as it is a very complicated singularity.
all right thanks for waiting okay so uh is it time yeah so I'm going to start this by A Brief Review of what in context learning is and then we'll do a little bit of mathematics and try and talk about what it means for understanding in context learning in the situation of singular learning theory and statistical learning theory in general before we get started some reasons to care about in context learning very briefly so this is one of the emergent capabilities of Transformer models it what it's what underline underlies some of their capabilities [Music] we don't have a good understanding of why it emerges or what exactly it is just one second [Music] and we would like to understand how it seems to be that Transformer models are capable of many different tasks within one model and the role of in-context learning and selecting tasks by providing context is the key to that so if you'll follow me around to this other set of boards so I'll just remind you of the architecture of Transformers briefly so please follow me okay so here I'll just recall from the form had a paper uh the decoder only form of the Transformer so it takes in entities uh processes entities it's input tokens so sequences of tokens those are mapped through I think the diagram is over here these are mapped through an embedding layer which changes those tokens into vectors entities those entities are then processed over multiple Transformer blocks then there's an unembedding layer a linear transformation on each of those entity vectors so each vertex in this diagram on the right hand board is stands for an entity vector and then there's a soft Max applied to that final Vector which produces a distribution of all possible tokens then you make a prediction about the next token so in the Transformer model there are many different ingredients there's layer normalization there's multi-headed attention and there's a feed forward layer shown here at the bottom on the right hand board I'm going to ignore the role of that layer normalization and the feed forward layer and focus on attention and only a single head for the purposes of today okay uh I think this should be on one of these boards something with a context yeah right so this was some time ago we discussed what in context learning is and we had this example so basically if you provide some initial sequence of tokens and then your question that initial sequence of tokens is called The Prompt or the context and that's what configures the Transformer to be the
kind of model that can answer the question given there in green okay so now I think I'm going to begin the story [Music] of a few moment yeah okay so we'll move over to a fresh set of boards please follow me again [Music] okay I'm going to set up some notation [Applause] so let t w denote the Transformer model where the tension weights w so that the notation from last time this consists of three sets of Weights query weights that transform entities into query vectors key weights value weights and these of course may vary between different layers so this is indexed over all the layers in the Transformer and this is as a family of functions where TN takes a sequence of tokens length in and Returns the prediction of the next two [Music] so as a distribution over tokens so you might recall from last time that in the definition of the loss function that you'll sign and find in foghata or in many implementations you're given a sequence of tokens and then there's a contribution to the loss from predicting the next token from every initial segment of those tokens so if you get a token a sequence of tokens of length 20 you'll have a contribution to the loss for predicting token 4 from tokens one two three token 5 from tokens one two three four and so on uh it's equivalent to just consider the case where you're predicting the next token I mean you're not taking a contribution from every initial segment so I'm just going to take the function which Returns the probability distribution of a token n plus one given in tokens now it's important and fundamental to this conversation that a given fixed set of Weights W determines this infinite family of functions sorry not infinite there's a context length so say 4 000 tokens for the current GPT models so for n through one through four thousand and for theoretical purposes Infinity let's say you get a Transformer model with the same set of weights okay so the unlike usually when we're talking about probabilistic sorry statistical models determined by parameters points in a parameter space for each parameter there's a single model for Transformers that's not the case and that's as we'll see somehow one way of thinking about what in context learning is about okay uh sorry that's a perspective right um you're just you're just highlighting this perspective I could equally say that for any any functions um or any model that outputs a sequence I can lstm or something like that yeah it would also be true of an lstm that's right um so it's not unique to Transformers
but it's uh maybe a an aspect of this kind of sequence model that hasn't been sufficiently well thought about from the point of view of theory at least as far as I know and the the phenomena of in context learning is maybe an invitation to think very hard about this because there's a kind of phenomenon of carrying right where you can append some initials segment and then get a new model of the same kind and this um yeah that's of course exactly what in context learning is about yeah yeah cool I just I just wanted to clarify that I'm thinking of it as a perspective just like a way of mathematically modeling what's going on where you basically do the carrying operation to change what you're thinking of as the function or the arguments or whatever that's right um just that I said this out loud but maybe the notation will come up again so the contribution to um of a single sequence I don't know why I switched to L's um the contribution of a single sequence to the loss yeah in particular that should be an l so just a reminder on what the loss function of a transformer is I guess so apply the Transformer to that input um and then you're going to take the cross entropy so this is the cross entropy of the sort of one hot Vector with a one in position XL plus one which is the true next token with the output of the Transformer and that's just negative logarithm the value of the distribution in position XL plus one okay so that's uh at least in Direction just the gradient of this entry we're not going to do much with that but um yeah that's that okay so now I want to talk about the role of of context so suppose the tokens were examples of a task you think about the first L tokens as being the question and the final token as being the answer so in context learning refers to the possibility that on a second question having the information of the first L plus one tokens makes you a better predictor foreign suppose you're given a second example question rather which is why underline so that's a sequence of L tokens then you have two things you can do right you can take the same Transformer so you can take your Transformer and you can feed it that question then it gives you a distribution of a possible tokens yl plus one or you can take with the same weights a Transformer which takes in the sequence of the given example plus your question and then returns some other distribution over possible tokens while plus one now it may be that that cross entropy is actually lower than the cross entropy
you get out of this distribution and you might say that's to do with the example being useful to to learn to predict better for the next question Okay so uh that's one way you can improve your predictor is if that's true to provide the examples of course another way to improve the function given this information is to take a gradient step that is change W to W plus Delta w and then use W plus Delta w to make your prediction and the the papers that sort of stimulated this particular seminar we're not going to talk really about them directly uh because I think they're a bit too flawed to present without further thought in in those particular directions that are being explored but the the question they're raising is can these two things be similar think bluntly the answer is no but it's an interesting question at least it leads to interesting questions so the question is these two ways of making use of the example one is to stick it in the context and by carrying get a new function and the other is to actually take a gradient step and thereby get a new function maybe those two functions are similar and if they were then you would say that in context learning is a form of gradient descent so let me state it as a question formally so I'm going to state two questions first informally is there some similarity between in context learning and learning by gradient descent now of course maybe providing the prompt just doesn't help you at all it could be completely irrelevant to the task of predicting the next token on y for instance it may even degrade the performance right so it's not like it's not like in general you expect any particular relationship between these two things right but if these two sample if these two sequences are drawn from the same distribution for example then you might ask this question okay so here's the first form of the question which is the way I just stated it so uh [Music] questions is this function similar to this function is the formulation clear to everybody okay now well there's a few reasons why you might be immediately suspicious right I mean the gradient step involves a learning rate but there's no learning rate over here so uh certainly for some values of the learning rate it's absolutely false maybe there's some magic that you know value of the learning rate for which it's quasi-true uh who knows but you can see there's various difficulties in even making sense of this question precisely so let me pose and then discuss a slightly different question which I
think is actually more interesting and has some chance of being true okay so as we already um just commented providing a context as a form of currying that's not an analogy that's strictly true so currying as we have a function with multiple arguments let's say two and you fix one of the inputs and thereby get a function of one variable so that process is called carrying in the programming language literature or um it's also a phenomena in just Cartesian closed categories right okay so you can see that this function on the left here has produced a new function by providing some input so providing a context is a form of currying which produces a new model the one on the left hand side and the previous line can ask if this model is a Transformer this is a more general question of course if question one were true the answer would be yes to question two question two says does there exist W Prime such that when you Curry you get another function of the same class okay so question question one positive answer would imply a positive answer to question two but question two is is much more General that's clear and if that is true uh how does w Prime relate to w let me just sketch uh why question two if there's a positive answer why that's quite profound so what it means is that when you're feeding large amounts of text that is made up of sub distributions to something like a Transformer I'll say more about this in a moment but I want to outline the high level idea each sub-distribution will correspond to perhaps a w Prime right and training the model on that sub distribution will be doing something like training or exploring the posterior for it for that sub distribution in a different part of parameter space the model itself right the the overall weights of the original model like the the Transformer the w uh W Prime is something that is determined by the data right maybe there's more than one example here but W Prime is a function of the context so this is the basis for perhaps a picture of and maybe I'll put this on the next board speculative but uh okay so here's speculative picture which you might call Transformer as UTM not quite Universal but uh so here's the idea so so think of the original weights so the full Transformer as the UTM and from the point of view of um of singular learning theory so this this corresponds to some very complicated Singularity so here's here's the mother Singularity w but what actually happens when you make a prediction well there's some text and then Along
Comes something like this and if question two is correct then once you've seen this you're actually no longer really a w anymore effectively what you've done is move to another part of the parameter space and you're generating predictions from it's the same parameter space right you're generating predictions from a model that behaves as though it has weights W Prime now if you see that context every time you see the next example why then you could imagine that gradient descent if that's where the way you want to think about it or the Bayesian posterior near that W Prime the effective Bayesian posterior for predictions of that sub distribution that's conditioned by this sequence of x's effective Bayesian posterior or the effective behavior of gradient descent is actually characterized by the sub-problem and by this Singularity so it suggests a picture perhaps in which parameter space is just dotted all over with these W Primes and mother Singularity W is just kind of redirecting queries to these child singularities that live all over the place okay so that's the kind of picture that I have no idea that's true right but that's a speculation which maybe gives you some idea of why it's so important to understand what in context learning is is about okay so uh as I said I don't I don't think question one is very plausible maybe I'll just briefly say something about the papers which inspired this uh seminar the the references are in the Discord and I'll circulate some notes for this later I won't read them out um so they make claims that uh in some under all sorts of approximations that a single step of gradient descent looks like what you get by prepending this context so they would say the answer to question one is yes I just don't find this plausible I mean the math is is correct but under many many assumptions and I'm not sure I really buy the experiments although maybe they're interesting here's an example that just maybe gets you to understand why you should be scared skeptical of question one uh well um okay so certainly it's false if the Y's and the X's come from very different distributions right and it's just hard to understand I don't see a way of getting from the approximations they're making to a more General argument and so okay maybe I maybe I shouldn't say I think it's false but I'm not convinced by the evidence put forward so far we can talk about it later but I I don't really want to dwell too much on those papers I want to give you some idea of evidence for question
two um okay foreign when you have a context so what we're going to do is just repeat the diagram from earlier but add in more nodes sorry then yeah can I ask a question just to check if I understand question two sure um my my question is uh so the the formulation to us as written on the board which seems to be like trivially yes because well if the original Transformer can learn a language model and we are drawing y from a more restricted sad then clearly there exists a w Prime that you know learns that learns that um so I'm guessing this is more than existence like this this sounds like sounds to me just like this is universal approximation yeah yeah that's a good point kind of thing yeah the way I wrote it down I mean I'm mixing up the role of the algorithm with the function so what I mean more like is that the uh are the algorithms similar in the sense that the way they compute on the representations involved similar not just the input output Behavior that's right so I you could make some argument based on your approximation that there exists some W Prime so that such that the overall functions are similar um perhaps right because W Prime is only some subset of the weights so what I'm saying is that you only get to change the attention weights first of all and uh you could ask for at each stage of the computation perhaps the entities are close to the corresponding entities rather than just the final predictions are close does that make sense so effectively restricted changes from WWE Prime like double Prime is not just a function of a context that's a function of a context and w as well um oh yeah I guess that's not quite doubly it depends right okay okay yeah any other questions is that similar to is that similar to requiring W Prime to be close to W so that it's some kind of local step even if it's not a gradient step yeah it might be I'm not sure I want to even commit to that [Music] so this is a bit of a silly example but um the the idea of the treacherous turn is that when you see this prefix uh TW Prime is suddenly very different to TW right so you could uh you could imagine contexts I mean in the case where they're drawn from the same distribution maybe you'd expect you know W Prime to be close to W and then the direction that relates them then it's more like you map context to directions right and maybe that's then imagine that in this picture perhaps there are many nearby singularities and you kind of uh kind of selecting which one to to move to with some small update
so yeah um I don't know the answer to that why does in deployment as a prompt necessarily moving a very different place in parameters yeah sorry oh sorry um white space yeah yeah there's a bit of an inside joke so uh in AI safety there's the idea that a system might somehow learn to detect that it's in training and be very friendly in training and then as soon as it gets out of its box start misbehaving so um the idea would be that if it sees yeah in deployment then it suddenly starts behaving very differently thanks [Music] okay uh maybe I'm gonna skip the picture one last question yeah so there are like thinking about the domain of language there are many examples where like a bit like the one you just explained where if you see a particular token that's going to kind of radically change how you respond to Future tokens uh I wonder if there are like similar examples where a particular token that you're going to make a gradient update on is going to radically change how you respond to um to Future tokens yeah this is where I I think it seems like it's very good yeah I was just going to say a team like for example um could you know gradient steps are local that I have this feeling that um it couldn't be there could be there could be some things that you could do with a with an in-context step that you can't necessarily do with it with a gradient step yeah I think absolutely so I think probably in context learning is strictly a more General than local learning um it may be that any local gradient step can be sort of mapped back into some appropriate context that I don't know but it does seem like a more General operation uh the case in which it's so there's we've just discussed kind of two different kinds of contexts right one is like my somewhat tongue-in-cheek example uh or something else that says now prepare to answer a calculus question and prior to that you were discussing philosophy um some sequences of tokens like that almost certainly are not local steps right uh so they don't they don't seem very local they change the behavior in in some significant way so maybe those are like these steps that I've drawn on the right hand side but you could then imagine that if you see okay now we're talking about calculus and I give you a calculus problem one calculus problem two well maybe those are actually kind of local steps um that are more like gradient steps uh so maybe some kinds of contexts do provide local steps and that to be fair is what those papers are talking about
so they set up a particular kind of of I mean they assume the soft Max is linear they they set up a particular kind of loss function they're assuming a linear model they set up a particular kind of data set and they show that these two things mathematically are similar so there's some evidence and also experimentally that for particular kinds of contexts there is something similar between um in context learning and actual gradient steps but I think the more General concept is is probably more interesting here's another example that I think is a little bit easier to think about if you assume that the language model is kind of like good at following instructions if you give it the token repeat after me then it turns into the identity function hmm which is very different from a language model Okay so here's my entities I get embedded and then after that there's a tension and I want to distinguish between so this here is the context right it's my new question okay and then I have attention so information is propagated from left entities towards every entity to the right and itself in multiple rounds the arrows that are of concern to us right now are the arrows that go from these kinds of entities over here and right so from this Left Hand Group into the right hand group if those weights were all zero then you would just have the context Computing on itself and your question Computing on itself and that would just be TW ENT with no context so the prospect of in context learning is to do with this overlap these purple lines line okay so what I'm going to do now is write down the equations and uh we'll do um some playing around with them and see if we can think about these purple lines okay recall the update rule for entity I is I Prime is e i plus the sum over J the soft Max query for I key for J and then there nominator v j where Qi is some Matrix times e i k i is some Matrix times e i v i is some Matrix times Z times e okay so these Eis these are the entities these are the Primitive objects which you think of the Transformer is Computing on queries keys and values are extracted linearly from those things they generate queries keys and values and then we get this form okay so let's play with this a bit so that means let me call it Delta e i which is the change in the if entity in a single step of attention so let me factor out that denominator so I'm going to split the sum I'm going to split the sum into bits that are from into our box uh so imagine this is I'm going to call these entities
Frozen for a reason I'll explain in a moment and these are unfrozen or just normal and I'm imagining my entity I is is in here somewhere right so I'm in some layer of attention and one of the normal entities so the entities that are involved in actually I mean remember that I'm going to at the end of at the end of this I'm going to get an output here and that's going to give me a distribution which I'm going to sample from to make my prediction of the next token so fundamentally I'm operating on these y eyes and EI is one of them okay now the JS J's range are all entities including both Frozen and unfrozen um and the reason why you're not seeing updates on the board no I'm I see the big Kerdi bracket open and then okay yeah I'll look into the logs later yeah thanks you can't see it either all right if you try opening the board do you see it when you open the board wait is it because Dan went back aboard I see that but yeah Dan went back to the first Board of the group and maybe um so I saw both because I was looking on the old cam oh I was looking on the old cam too sorry that was probably me uh do you see it do you see a triangle square love Hut yeah okay good uh okay so I'm going to split this sum into two sums uh one is over J Frozen and one is over J unfrozen maybe just call them normal okay and these these are the purple lines these are messages the VJs received from Frozen entities to update an unfrozen entity and those are the ones we want to understand right because that's the difference between TW and TW with a context I hope that's clear so the difference between TW Computing Y and TW with this context hitting Y is precisely in this attention and it's I mean it's over multiple rounds so once it's run a few rounds it could change the all the entities in the right hand box the normal entities and then maybe who knows what's going on but if we're just going to analyze a single iteration uh these what's happening with the normal entities is is the same as it is in TW and the purple stuff is the stuff we have to understand okay so I want to explain what I mean by Frozen so we are going to assume it's going to be a bunch of assumptions and approximations I'm not saying this is this is like some cute initiation of an investigation right so we're going to assume all entities in the Frozen box are frozen past some layer okay so maybe near the bottom of that diagram compute compute compute but then after some point let's assume that the Eis or rather ejs for J less than or
equal to l plus one are fairly stable they don't change much in particular the that means the vj's KJS for those entities don't change they're frozen okay so let me say something about some special cases before we do uh well a special case before we do a more interesting kind of calculation if you haven't thought much about attention maybe uh maybe it's worth knowing that insufficiently higher Dimensions you randomly sample two vectors and they will be orthogonal with probability one right if there are many directions in your space most vectors are orthogonal to each other so you should think of the queries and keys as being in such a sufficiently high dimensional space so unless there's some particular reason for a query vector and a key Vector to line up this dot product will be zero right that's the kind of generic case something has to be arranged and imagine the the weights are initialized randomly so that when the Transformer is initialized all these things are orthogonal to one another it actually has to tune things to get them to point in the same direction okay so if Qi dot KJ is approximately zero it can be large and negative right you can tune entities to ignore actively other entities that does happen in practice this is the generic case so if Qi dot KJ were approximately zero for Frozen J and my I is fixed this is one of the normal entities so this is approximately zero for Frozen J uh then and you assume that I mean if if there's some attention of normal entities to have eye attending to some normal entity sorry about the cow it's dinner time well then those first some ends will just drop out so you'll get approximately the sum over U not frozen so it'll just look like the attention as though the context isn't there okay which makes sense so this is I.E just what TW would do so this is the case where the context is irrelevant the entities that are in the context just the Transformer has learned that there's no particular relation between them and the entities in the question in which case you just get TW back so I guess in that case the answer to question two is yes TW is TW Prime okay so here's a more interesting situation okay main case for us still a very special case suppose that for each Frozen entity J remember these are Frozen They're not changing from layer to layer let me call it t there is an unfrozen index a t okay so we're talking about a single layer of attention right so EI is being updated in some layer suppose in that layer so in each layer there's different
attention weights suppose in that layer the value Vector for the entity EI well sorry for the uh how should a six so here's the Frozen entities Frozen all the entities are going to attempt to pass information to all right we're all we're all trying to pass information to this guy that's what the update is about now I'm supposing that for each Frozen entity so that's going to it's got some value Vector VT I'm supposing that in this particular update there's some entity over here J whose value Vector is is close to that one right so that's J is 80. save I sort of have some idea how you might weaken this hypothesis but we're going to stick with that for the moment is it is it clear what I'm assuming I'm not convincing you it's reasonable uh but is the Assumption clear okay then we can write the update by grouping the things with similar value vectors so just take the formula on the left hand board and merge the sum over Frozen JS or frozen T's now into the other sum so now it's a sum of a j normal the original coefficient plus a bunch of new stuff this is all the tea Frozen who want to be friends with Jay so a t is J okay and that's all times v j close break [Music] uh that's that's multiplied at the end there all right uh so now we've got a sum of exponentials and it's a standard trick to do at this point of a log some exponential foreign definition or what it says so take a sequence of n real numbers and take the logarithm of the sum of their exponentials it's commonly used in machine learning as an approximation a smoother approximation of the max function the error term is of the order of log n so what is log n so the values of x i are large relative to to n or rather log in this approximation is pretty good foreign okay uh so let's write I mean I just want to take that term in the brackets and rewrite it as a single exponential and the point is just that it looks like the largest one I guess so uh if I let Z be the log some exponential of all those exponents well if there is a clear winner then that will be approximately the clear winner so that's my second hypothesis that in any given update maybe that's one of the unfrozen entities right has the largest dot product and it's it's separated from the others or maybe it's a frozen entity so in the case where there's a clear winner the gradient update looks like this uh and in the case well you can do much better than this this is not entirely unrelated to some of the things that Ben was talking about in his seminar talks on renormalizing
icing models right but okay for the moment this very rough calculation gives you that um the gradient update looks like a sum over the non-frozen entities so if we reduce this update from attention that included information from both Frozen and unfrozen to an update only referring to non-frozen or normal entities but with some modified weights so this is attention over the normal entities but with modified key weights [Music] WK uh right because KJ has changed to KJ star now you actually did this calculation properly uh you would sum over all the value vectors for the normal JS you would take contributions to each one from you would cluster the Frozen entity value vectors by those that are closest to VJ and then you would incorporate I mean rather than just taking the max like I've done here you can do this in a more sophisticated way probably I haven't done but uh but you see the idea right so by reducing this sum to a sum over normal entities we've effectively replaced the key weights the original key weights by some modified key weights and we've got um so so call it WK Prime and with W Prime we didn't change the query rates we didn't evaluates although those would also change under the prescription I was just outlining uh okay at least in this particular step we had the TW Prime of Y was similar to t w of X XL plus 1 y I hope that badly answers your question Edmund so in this case what I'm saying is the attention Works similarly not just the final outputs are similar okay uh yeah any questions I'll say a little bit more but that's the kind of technical part finished I think so as I said this is not meant to convince you that question two is true what I want to say is this is very important maybe some of you should think about it and take it quite seriously because the reasons I suggested on the slide with Transformer is UTM when we're talking about large models like Transformers that contain multitudes and these multitudes being the W primes right these kind of subroutines that are there to process various different tasks we may have to change the way we think about the statistical learning theory of these systems to account for we're probably learning many things right it's like there are many effectively many sub models that are being learned in different parts of parameter space with different rlcts etc etc so this kind of coordination of multiple singularities by a central routing Singularity if that's what's going on this probably calls for a rather
different approach to understanding the learning theory so yeah I think it's um this is meant to be a invitation to walk in this direction this is by no means convincing as it stands there are no questions then uh about what I'm going to say next yeah so what Ben and I were preparing was that we actually spent some time thinking about um about actually Computing the gradient update uh so if you if you back propagate an error like I was showing on the first board through the Transformer to a particular weight then you can compute what that is and try and compare it to what you get from the change from in-context learning but I feel like that's that's going to more into the details in this line of Investigation maybe make sense for a future seminar so maybe I'll I'll stop there and open the floor for discussion maybe let's go back to this interesting slide yeah so I haven't thought about this for the first time today but uh it could be quite interesting to Think Through what it means to effectively learn sub models using a central model like this um like like what is the how do these Bayesian posterias fit together I think there was a question I have a question um is the Transformer do you think the simplest model where it's appropriate to try and study this um because let me propose something I was just thinking about um what we said at the start of that currying and you know LCM sequence models and stuff but actually I think you can kind of think of any multi-input model as you can think about carrying it so I wonder if you could also think about it in terms of routing um sub models and stuff so for example uh let's just say you have two inputs and you're learning a linear function um then the first input sort of takes you to a particular slice of that function um which is going to be another linear model um and thus determines like which which sub-region you're in and how you respond to the second input um so yeah that's that's an example I wonder what you think of it I wonder if you think it kind of is complex enough to capture the essence of this or if you think there would be anything simpler than a Transformer that would do it see no reason why they wouldn't be um yeah it would it would be difficult I mean okay if we just take one multi-headed well one attention layer on its own without all the other stuff uh that's a relatively simple model so maybe that's uh maybe you can do simpler than that um but it's it's possible even just a single attention layer uh
so that's just some linear just some Matrix multiplications and a soft Max maybe that's enough I don't know I mean the model you're proposing is simplest still I mean I think it's a good idea to look for the simplest example for sure would such a simpler model still need to it still needs to demonstrate in context learning empirically somehow uh yeah I guess maybe that would be nice um I suppose the way Theory often works is that you you find the toy model that maybe doesn't even have the particular Behavior but like you can see a theoretical signature of something that might be relevant so it certainly wouldn't be satisfactory to stop there but maybe it's a place to get started it seems unlikely very simple models are going to show some interesting form of in context learning I I don't well I'm not I'm not sure I'm not aware of not aware of a literature that's sort of suggests that this is a very common property yeah I guess my my proposal would be try and in the simple model where you can keep track of the number of sort of moving Parts just all on a page um try and find try and like draw that speculative diagram to clarify the concepts and make sure you're not like making some kind of type error or you know like at least to me that would that would help like what for example does this Singularity routing look like when I'm actually talking about a model with a handful of dimensions and I can kind of visualize it um rather than just sketching some abstract projection of like a high level high dimensional space um yeah that's the proposal yep and then you know what you're looking for when you when you go up to the Practical scale models yeah I guess you could even just take a one layer 10 H network but with some kind of repetition of weights right I mean the the point here is that the the model can handle arbitrarily long input because there's a way that the weights are used is not dependent on the index of the entity right the weights yeah that's important yeah like um uh my example of just a two-dimensional thing um what you when you when you give one input you're left with a one-dimensional thing so that seems like a problem yeah I think that's a good idea of cook up some very simple thing super simple what about like a maybe like a recurrent neural network that you are allowed to do for like two recurrences or something yeah that's that's independent of the index but in a different way so I don't know that the iteration I mean uh of course the distinction between a an
RNN and a Transformer is somehow that the Transformer is wide rather than in that way so I think that's a yeah water would be simpler I mean it seems to me just a one layer attention model I mean a single attention model would be simpler than having an RN into it yeah you're probably right I have another question but I'll let anyone else jump in I was going to ask about uh the singularity explanation you gave on the right hand side of the yeah um why is it that you think these other singularities in blue are simpler is it because that this is to do with the paper showing in contacts in context learning outperforming oh fine tuning yeah this is just completely made up so this is my intuition that I'm sharing but the intuition is that effectively what w does is it Maps contexts to or you might say sub distributions so you have all the data that the model sees and then within that data are various sub distributions like English versus Chinese text for example um some amount of samples are sufficient to figure out which sub-distribution you're in and arguably what I mean according to this intuition to know this is true but you could argue that what a Transformer is learning to do past some point in training is not actually answer the question but learn how to map contexts to sub models sub model is a bad word for it because it's in the same parameter space but I mean map contexts to singularities or like some region of parameter space right so the Transformers like the the purpose of w this original routing Singularity the UTM this analogy is to see X and produce W Prime and W Prime is the model or determines TW Prime which is good at answering questions about X something like that okay caveat import ant does anyone else think of the original the the Transformer TW iven a point in a vast Wade space w as I mean there as as a gigantically complex joint distribution of all possible sequences of tokens and therefore given a prompt your your conditioning there's a conditional that's right which is kind of smaller than the full joint which is what I what came to mind when I saw this simpler or smaller singularities that's exactly what I'm thinking yeah the interesting thing is that the weights are all it's all the same weight space right which is the sort of interesting novelty here yeah and I confess that I kind of struggle with the word learning in in context well in fact that's what this whole seminar is about right it's um all we're really doing is taking is conditioning
uh evaluating a conditional distribution well but you're learning to map contexts to I mean okay you could say it's a conditional distribution but uh you don't actually know what to condition on right you're not actually conditioning so the task being performed here is arguably to map a context which is a very I mean uh I don't know it's it's much less than you'd need to condition the distribution um and you don't know the distribution right so maybe that's the point so uh if you knew the true distribution you could condition on it or on some but uh I don't think it's but it is it does live in the trans I mean it's it's fully determined by w isn't it yeah um but it's okay so the SpaceX on which the distribution is living is okay so that's uh sequences of some length right that's the distribution so we're mapping a parameter to I mean it's like the the distribution that's given by TN rights for some fixed in and then you're saying if you if you fix some of those some of those values then that's conditioning that's true [Music] um I have an interpretation which is like um just that um if you think learning as just some algorithm that takes in data or train examples to um to another algorithm that produce prediction given by the data then I'm guessing the TW the Transformer itself is the learning algorithm um it takes in training data context and then produce another algorithm that can take Y and produce next tokens so but the Transformer is the burning machine it's no longer being learned it's wrong for me to make it's a distribution generator not a distribution implicit in the w I guess yeah that's right sorry I didn't mean oh that was too loud I echoed um I I was I just spent a moment on the psychiatrist's couch with the word learning no I think it's right I mean that's that's uh well it I guess learning doesn't necessarily have to mean gradient descent um but uh you're not saying that either right you're just saying that uh it means moving to a different point in W um at least in uh a priority to me or in the Gradebook for example I agree but then if question two is true then that is exactly what providing a context does yeah uh I think Matt uh sorry I think um I had another question I I wanted to say even sorry I uh just continuing on the on what uh Russell I think uh mentioned just now which is like even even in the gray book learning is not moving to it's kind of critically not moving to uh different W because it's patient uh the the learning process in W is
um producing uh sorry take example produce produce posterior and echoing uh take example produce posterior and then average over posterior whenever you get a new um input um that's the learning process yeah one thing I'd like to think through is what this picture means for the posterior um so maybe to take a simple example yeah maybe we get nowhere with this but it might be interesting um suppose that the context switches you I mean it's just a hierarchical distribution so you flip a coin and if it's heads then you roll a six-sided dice that's normal and if it's Tails then you roll some biased dice okay so um if you have a hierarchical distribution like that and you see some information that tells you which distribution it's from yeah maybe I don't know what that means for the relation between the kind of overall prior posterior concentrated around this W so the the overall model versus you could imagine having sub models that are learning the parameters appropriate for each of those two components of the hierarchical distribution you know what I mean so I mean under under this picture the speculative picture on the right hand side uh the learning at some point once once the Transformer has acquired this configuration the learning that's happening near W uh if you like is more about routing than it is about actually solving any one of these given problems right so w Prime becomes very good at whatever predicting the next English word and W prime prime becomes very good at predicting the next French word [Music] and then the the original I mean what what w you know it's not so clear that it's actually solving any of those problems individually [Music] Matt you had a question see you again I'm not sure anymore I I too I'm sort of struggling with the um thinking of this is where an invest is just conditioning but I think that's the I think it's both I don't know if it was helpful to anyone else but I think it's like a it's like a perspective that you can take on what's going on like you can like a Korean thing where you can view a function of two arguments as taking them sort of both at once or you can view it as a function that takes an argument and returns a function that takes an argument that returns another output um so yeah I I just think like if you if you accept this is a perspective and then the question isn't like which side is Right which way to view it is right it's more like which way to view it is useful for a particular thing that you want to get out of the
model if we want to start prompting these models as a way of tuning them then and we want to have like a theory about that process then probably we should try and do it as learning and see how it relates to our existence technology what kind of statistical model is this what is the parameters and the question that Dan is asking is something like um it is the same model like it's just the Transformer model as if we were training it as if we were fine-tuning it and training tuning the whites yeah maybe that's true maybe it's not either way it's like something there's some kind of implicit model there and it would be useful to try and understand it perhaps from that perspective that's how you go on then I'm just wondering because um at least my understanding of this is that in context Landing is kind of fine-tuning the entities as opposed to the W parameters which they're getting multiplied together so it's still like I guess indirectly updating w parameters by the entities and so I guess I still think of it as lighting yeah this is still kind of changing yeah I think that's right I mean the that's when you when you look at how the Transformer processes its input layer by layer it's I think by any reasonable definition it's learning uh because it in at the beginning it just knows each token it's Universal embedding for that token right it doesn't in the sentence know that there's any particular relation between he and Matt uh but over the course of processing the tokens and uh its experience in predicting the next token in English language it comes to acquire the information that he refers to Matt and that's represented explicitly in the entities themselves and the directions of these query and key vectors that information transfer is learning it's not learning that sticks around for the next example but I don't know it's it's information acquisition maybe um whether you want to call that learning or not I'm not sure so the the statement here is that that form of information acquisition May in some case look like uh the form of transfer of information that you get by gradient descent which is also accumulating information from the data into the weights just in a different way so suppose an angel comes down and hands through the function W Prime of context what now well now uh well not sure what you're asking um what would you do with it like uh well apart from what the Transformer itself does with it which is predict uh the next token uh you're asking like what would have what would a better
understanding of this mapping what implications would that have for analysis of Transformers for example or understanding the algorithms that they're executing is that the direction equation is there yeah yeah um well so you could then look at w Prime and try and understand I mean it's just to some point in parameter space right but if you were to formulate the sub distribution so the the particular y's and the map to the next token has its own learning problem right so within the original training distribution for the Transformer take these sub distributions you can formulate them as Independent Learning problems with the same parameter space and take their kale Divergence and W Prime would be a singularity of that kale Divergence and you could then understand its properties and that would be kind of like a way of decomposing I mean arguably from in so in this storyline these W primes are all like subroutines which are combined in some way to be executed and if you can analyze them independently it would be like a divide and conquer strategy for understanding what the original what the overall Transformer is doing so maybe we have more chance of understanding what these components are doing theoretical properties if we can maybe they're not a hundred billion dimensional right maybe there's a much lower dimensional Subspace in which these things have the meaningful degrees of freedom or something I don't know I think what your question just reminded me of is it's quite important to note that if question two is true that means that keep in mind that if you allow the Transformer to read the tokens that it predicts right so you predict the next token and then you process the sentence with that token in it which is the usual way you interact with gpt3 for example by predicting tokens it can condition its own model right so under this scenario if you're feeding many tokens into gpt3 it can actually move itself effectively to different parts of the parameter space so it's quite important then to understand where those places are that it can go right if you want to understand what it's doing or what it's capable of I think that the possibility of internalizing the learning process in this way is seems quite significant for standing what's going on inside the model and from a theoretical point of view it just it might be kind of futile to study W itself right I mean we train the model it's some very complicated Singularity uh we probably have very little chance of