WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
In this video, researchers discuss in-context learning in transformers, which is the ability for the model to learn from the context of a sentence. They propose that induction heads are the main mechanism for this process and present experiments conducted on Transformers with up to 13 billion parameters. Their findings show that models with multiple attention layers are better at in-context learning and that induction heads form a phase transition. They also note that removing induction heads during testing reduces the model's ability to learn in context.
In the paper "In Context Learning and Induction Heads" by Olson et al., researchers propose that induction heads are the main mechanism for in-context learning in transformers. Experiments were conducted on Transformers with up to 13 billion parameters, showing a phase transition where induction heads form. Two classes of models were presented; small models with one to six layers and twelve attention heads per layer, and large models with four to forty layers and ranging from thirteen million to thirteen billion parameters. Modifying the architecture to promote formation of induction heads brings the phase transition earlier and removing induction heads during test reduces the model's ability to do in-context learning.
Induction heads are attention heads which use completion algorithms to predict the next token in a sequence, such as A B A B, by looking for the previous occurrence of the current token and trying to influence the prediction that the next token will be the one that followed it last time. Researchers measure the formation of induction heads by observing the prefix matching score of particular attention heads, a proxy for the algorithm's ability to detect repeating patterns in random sequences of tokens. Fuzzy versions of this are also possible, where similar tokens are used instead of the exact same ones. This is used in tasks such as translation.
Attention layers are used to measure contextual learning in Transformers, with a Prefix Matching Score (PMS) used to measure attention heads. This is done by generating a random sequence of 25 tokens, excluding the most and least common tokens, and appending a start of sequence token. Attention scores are then recorded at each token and the PMS is the average attention assigned to the token preceding the current token. This is more effective than using a start of sequence token and can be used to visualise where a particular head is directing its attention. To improve the algorithm, the most frequent and least frequent tokens are excluded, likely to isolate the majority of tokens and force the algorithm to use generic algorithms.
In-context learning is an approach to natural language processing which aims to improve next token prediction. A score for this is calculated by subtracting the average loss of the 50th token from the average loss of the 500th token for 512 token sequences. The authors observed a sudden change in three metrics - in-context learning score, prefix matching score and training loss - occurring at the same time for models with at least two layers. The orange window on the graph indicates the same period for each metric, with the middle graph showing multiple curves, with the induction heads jumping up significantly. The y-axis of the graph indicates the average attention paid to an earlier token.
Models with multiple attention layers are able to better perform algorithms that look at the token matching the current token and the token immediately following it. Plots of prefix matching scores, Inca in context learning scores, and training time of small and large models show a sudden jump in performance at a particular point, which is more prominent in earlier layers of the model. This indicates that single attention layers are not as effective as multiple layers.
Researchers have observed a sudden change in the loss of a model when trained on language. This is seen in multiple measurements such as the derivative of the loss with respect to the token index and the trajectory of the principal components of the per token loss. The in-context learning score for all models, regardless of size, is approximately constant after the phase transition, suggesting that the advantage of larger models over smaller ones is evident early on in the sequence and does not improve more. 500 minus 50 test results show that bigger models are still better at predicting the next token, but the difference between the first and last token is relatively constant. This implies that more sophisticated strategies do not lead to better scores on the in-context learning test.
When increasing the number of layers of a deep learning model, researchers have observed an improvement in its performance on in context learning tasks. This is likely due to the formation of induction heads, which are more or less synonymous with the ability to do in context learning. The plateauing of the model's performance is possibly related to the 5500 token definition in the in context learning score. The speaker hypothesises that the Transformer model may be following a simple pattern to predict the next token, but other types of induction algorithms may also be present.
Researchers have introduced a mechanism to one-layer attention models that allows for induction heads to form. This mechanism allows a token to pay attention to itself, incorporating a bit of the previous token's key. It is suggested that these induction heads are responsible for in-context learning, as the model's ability to perform in-context learning is reduced when they are removed. Further research is needed to understand the heads and how they contribute to in-context learning.
Researchers studied the effect of in context learning on small models using a pattern matching metric, finding that heads with higher scores were more responsible for this. However, when looking at large models with 12 million to 13 billion parameters, they observed that the induction head was able to apply a transformation to the sentence before allocating attention, suggesting it was doing more than simple induction. A metric was used to measure the performance of the algorithm, and the induction head scored highly, indicating that it was doing something more complex. This was further supported by a synthetic pattern matching task.
In this talk, the paper "In Context Learning and Induction Heads" by Olson et al. is discussed. It proposes that induction heads are the main mechanism for in context learning in transformers. During training, three events coincide in what looks like a phase transition: a sudden drop in training loss, the model's ability to do in context learning, and the formation of induction heads. Modifying the architecture to promote formation of induction heads brings the phase transition earlier. Additionally, removing induction heads during test reduces the model's ability to do in context learning.
Researchers have proposed a hypothesis that induction heads are the main mechanism for in-context learning. To test this, experiments were conducted on Transformers with up to 13 billion parameters, which showed a phase transition where induction heads form. It is unclear if this hypothesis explains in-context learning in general, but the evidence is compelling.
A paper presents two classes of models; small models with one to six layers and twelve attention heads per layer, and large models with four to forty layers and ranging from thirteen million to thirteen billion parameters. The small models have two versions, attention only and with fully connected layers, while the large models all have fully connected layers.
Induction heads are an attention head which exhibit a type of completion algorithm behavior. When looking at two tokens, A and B, in a sequence such as A B A B, the head will attempt to influence the prediction of the next token being B. This is done by looking for the previous occurrence of the current token and predicting or trying to influence the prediction that the next token will be the one that followed the current token the last time it occurred. Fuzzy versions of this are also possible, where similar tokens are used instead of the exact same ones. This is used in tasks such as translation. The head contributes to the overall sum by almost entirely contributing the value B.
Researchers measure the formation of induction heads by observing a change in the prefix matching score of particular attention heads. This score is a proxy for the algorithm of the head detecting repeating patterns in random sequences of tokens. The head must be able to identify these patterns, rather than just memorizing which token comes after another. This score is a measure of the head's ability to perform the algorithm.
Prefix matching score (PMS) is a quantity used to measure attention heads in Transformers. It is computed by generating a random sequence of 25 tokens, excluding the most and least common tokens, and appending a start of sequence token. Attention scores are then recorded at each token and the PMS is the average attention assigned to the token preceding the current token. This was found to be more effective than using a start of sequence token, which was used in earlier models. It can also be used to visualise where a particular head is directing its attention, when the token has not appeared before.
Researchers are using attention layers to measure contextual learning. Attention layers are parameterized by weight matrices that convert an entity into query keys and values for attention, and heads are the way of parameterizing these qkv matrices. To improve the algorithm, the most frequent and least frequent tokens are excluded. This is likely to isolate the majority of tokens and force the algorithm to use generic algorithms.
In context learning is an approach to natural language processing which aims to improve next token prediction. A score for in-context learning is calculated by subtracting the average loss of the 50th token from the average loss of the 500th token for 512 token sequences. The authors claim that the arbitrary choices of 50th and 500th tokens do not affect the score much. This score is applied to both natural language data and synthetic sequence examples. It is not mentioned what the testing data is, but when randomly generated sequences are used, it is usually to compute a particular metric, such as the prefix matching score.
The authors observed a sudden change in three metrics - in-context learning score, prefix matching score and training loss - occurring at the same time for models with at least two layers. The orange window on the graph indicates the same period for each metric. The middle graph shows multiple curves, with the induction heads jumping up significantly. The y-axis of the graph indicates the average attention paid to an earlier token.
Attention scores are allocated to all previous tokens in a sequence, with each token receiving an average of half of the total attention. An induction head is not a binary concept, but a continuous scale between one and zero, with each head using induction-headedness as one part of the algorithm to compute the desired result. The first graph in the sequence shows the smallest model with one attention layer, which is just doing one attention calculation.
In models with single attention layers, it is difficult to perform algorithms that look at the token matching the current token and the token immediately following it. When comparing three layer models to two layer models, the sharpness of the phase transition is significantly softer in the former. Plots showing the prefix matching scores, Inca in context learning scores, and training time of small and large models show a sudden jump in performance at a particular point. This jump is visible in the loss and in context learning scores, and is more prominent in earlier layers of the model.
Researchers have observed a sudden change in the loss of a model when trained on language. This is seen in multiple measurements such as the derivative of the loss with respect to the token index and the trajectory of the principal components of the per token loss. Interestingly, the in context learning score for all models, regardless of size, is approximately constant after the phase transition. This suggests that the advantage of larger models over smaller ones is evident early on in the sequence and does not improve more than the smaller models.
The model size does not appear to improve its ability to learn from context according to the 500 minus 50 test. Bigger models are still better at predicting the next token, but the difference between the first and last token is relatively constant. This suggests that more sophisticated strategies, such as looking at the previous two tokens instead of one, do not lead to better scores on the in-context learning test. This is perplexing, as it would be expected that if one knows how to solve a problem, they should be able to make a good prediction of the 500th token. However, the in-context learning score keeps decreasing over training.
Researchers have observed that when increasing the number of layers of a deep learning model, the model's performance on in context learning tasks improves. This is likely due to the formation of induction heads, which are more or less synonymous with the emergent capability to do in context learning. The only models that do not form induction heads are one layer attention only models, which are not expressive enough to do this. The plateauing of the model's performance is likely sensitive to the 5500 token definition in the in context learning score, although researchers have played around with these numbers and found that it does not change things.
The speaker suggests that the Transformer model hypothesised by the experiments may be following a simple pattern of looking at the previous token and the token that came one token after that last time to predict the next token. They suggest that this pattern may be too simple and that there could be other types of induction algorithms that could be just as effective, such as looking at the second to last token and the token that came two tokens after that. The speaker suggests that the experiments have not ruled out the possibility that other types of induction heads may be present.
Researchers observed induction heads which go beyond a certain algorithm. They modified the architecture of models to make it easier for induction heads to form by adding a mechanism which allows it to mix and learn how to best mix up key vectors in the same attention head. This mechanism allows for data to move from one token to the next, creating a map which helps the model better understand the sequence.
Researchers have introduced a mechanism to one-layer attention models that allows for induction heads to form. This mechanism allows a token to pay attention to itself, which is something Transformers do, by incorporating a bit of the previous token's key. During testing, induction heads are removed in a complicated way to try and attribute the amount of in-context learning each head was responsible for.
Researchers have identified induction heads as potentially responsible for in context learning. When these heads are removed, the model's ability to perform in context learning is reduced. This suggests that the heads are responsible for in context learning rather than some other mechanism. However, more research is needed to fully understand what the heads are doing and how they contribute to in context learning.
Researchers studied the effect of in context learning on small models using a pattern matching metric. They found that heads with higher scores on the metric were more responsible for in context learning. However, they concluded that this evidence is only suggestive for models with fully connected layers (MLP case). They suggested that the algorithm may be distributed across multiple heads, which may make it harder to detect.
Researchers looked at large models with 12 million to 13 billion parameters and observed what certain induction heads were doing. They identified the induction heads using a prefix matching score and observed how they allocated attention to the same sentence in three different languages. The induction head was able to apply a transformation to the sentence before allocating attention, showing that it was doing more than just simple induction. The test used to detect the head was already a form of generalization for the two behaviors.
A metric is used to measure the performance of algorithms. It is remarkable that a head which is an induction head scores highly on the metric, suggesting it is doing something more complex. This is supported by a synthetic pattern matching task, where the model has to assign labels to combinations of months, colors, animals and fruits. It is unclear whether it is the 80% or 20% of the induction head that is doing something else.
A narrow definition of induction heads refers to an algorithm which matches repeated sequences of random tokens. However, when presented with more complex data, the same head can perform a more general version of the algorithm. This is demonstrated by their prefix matching score, and could be due to the layers immediately before mapping a dictionary from one language to another. The head is then able to perform an induction heady kind of task, ignoring any difficulties with word ordering or translation.
all right so it's our pleasure to have Rowan formerly a master's student at the University of Melbourne now neural Wrangler of granular materials to tell us about induction heads thanks Ron no worries uh okay so welcome everyone um in this talk I'm going to be looking at uh I'm going to just basically going to be summarizing the paper uh in context learning and induction heads um from Olson at Al and anthropic um uh this paper discusses something called induction heads which I'll Define just in a little bit um which are a sort of I guess a proposed emergent structure in Transformers which Implement a particular algorithm um and the main claim of this paper is that these induction heads constitute the main mechanism I guess um the main mechanism for in context learning in transformance um and they break their their uh I guess argument towards yeah or the evidence supporting this claim into into six uh sub-arguments uh I'm only Really Gonna carry only going to really cover the first one in detail because I think it's the most interesting but I'll briefly run through some of the other ones as well um okay so the first one is that we'll go into more detail in each of these as we move on and talk about exactly how they're detecting these things and measuring induction heads and things like that um but basically uh the sort of most interesting point I think here is that during training um three three events coincide in what looks like some sort of phase transition of a transformer model so we have a sudden drop in training loss uh the model gains the ability to do in context learning um as measured by a particular metric and these things called induction heads uh appear to form right uh the next major thing is uh when they they mess around with the architecture of their model to promote the formation of induction heads then this uh this brings forward that phase transition earlier in training for models that are able to form induction heads and in models that are not able to form induction heads uh it it uh it gives them the ability to form induction heads and they they go through this phase transition um the other ones so uh they do a thing with a removing induction heads during test so modifying the um the way the model Works after it has been trained um and they see that this uh reduces the ability of the model to do in context learning um and then uh I guess I'll quickly go over the last four points although we won't talk about them too much um we have this sort of so they have
observed kind of induction heads doing a photo things that are induction heads doing a modified more sophisticated version of their algorithm um which I'm going to call fuzzy induction heads these will make a bit more sense when we actually talk about it um and uh the last argument I'll talk about is uh the uh reverse engineer induction heads in small models and Echo upon doing that are able to show how they contribute to in-context learning um so when they say it's they think induction heads constitute the main mechanism for inter context learning that they really are they truly making it the sweeping claim that this explains in context learning sort of in general or I mean surely they can only really have evidence about these rather toy data sets and small Transformers or are they actually data particularly point one they they did use large models so they I'll go over the architectures in detail but so for for one they did these experiments on um Transformers up to 13 billion parameters and You observe the same phase transition um the there seems to be induction heads forming but the picture's a lot more noisy than than um small models and I guess it's more like they're not claiming to necessarily answer this this hypothesis um but I guess this is the hypothesis there proposing they're supplying evidence for this hypothesis um and I guess we can be the judge of whether or not we believe it or not sort of maybe skip ahead a bit do you do you think it's more like maybe in context learning is not one thing and maybe there's rich in context learning and kind of basic in context learning and that induction heads are maybe like a sufficient or Nest maybe a necessary part of the most basic form of in context learning but maybe doesn't explain the the complete phenomenon or how are you thinking about it after looking at the paper yeah I think I guess I'm still figuring that out something along those lines could be is is is maybe is maybe it I think what I'm more thinking is so I think that evidence for um point one sorry point one is is quite compelling um that there's this point at which the model gains the ability to do in context learning and this this sort of this phase transition is observed in multiple different ways of measuring of you know of measurements you can take of the model including this this um this measurement which I'll go through in a bit um for how they're sort of claiming to identify induction heads I think like you know you'll see when I put up the
definition of induction head that it's really not super well defined um exactly what an induction head is um so I think I think maybe the claim that will induction heads do in context learn like as stated you know as we're going to Define induction heads the the ink that induction heads do the majority of in context learning is sort of obviously wrong right like they can't do translation as as we're going to Define them but then there's this sort of fuzzy Point here so I think probably what's going on is that you know you have these sort of things that are sort of somewhat induction heads or or kind of induction head head like that all form at once at the same point during training for large models and for small but you know they form at different points depending on the model but they all form at the same time um for you know large models and small models um and at the same time it gains the ability to in context learning um how good that is depends on the model and I guess the type of structures that form depend on the model um but but yeah I guess I don't know if like induction heads is the right way to sort of scale it or if it needs to be a more complicated picture cool thanks okay any other questions there's six points in the in the paper if you've if you've skimmed it the last point is that you can extrapolate from small models to large models um which yeah I don't know I don't really have anything to say about that um cool any other any questions before we move on about these these points I'll sort of go into the into the details on this um in you know in just a bit um the next thing I have in my notes is a quick run through of the model architectures uh so they break down their models into so they're all they're all GPT style um Transformers using multi-head attention um and they break down their models into two classes so we have small models um these have between one to six layers um and 12 attention heads per layer so that's like the number of attention heads in the in the multi-head attention calculation um uh each have each of these one to six layer small Transformers have two versions um attention only and um I guess uh with fully connected with fully connected layers and then we also have large models um and this is a similar story but they're bigger um they have uh they all have uh fully connected layers so they're all sort of a more normal Transformer um and they range from four layers with 13 million parameters to 40 layers with 13 billion parameters
foreign and they only do some of these experiments with so they don't do all of these experiments with all of these models um I'll I'll mention where they only do it with small models um if I remember um but I think the most interesting experiments they do do with all models and they do observe interesting results with with uh with all models um okay so now let's talk about what a induction head is and how we detect them so um I'm going to write down definition but it's a bit fuzzy so an induction head um is an attention head which exhibits um a type of a type of completion algorithm Behavior so for any tokens A and B and a sequence of the form a b so some stuff at the start a b a gap and then another a so a is the the token that this this head is looking at um it influences the next token prediction to b b right so an induction head performs an algorithm which looks for the previous occurrence of the current token and predicts or tries to influence the prediction that the next token will be the one that followed the current token the last time it occurred right um any questions about that before we keep going cool um you can consider sort of fuzzy versions of this as well so where instead um of having the exact same a and the exact same B here we have like a similar a and then in Us in the same way a similar B um and you can sort of do a fuzzy induction so the examples they use with this uh like translation tasks um or or something along those lines we'll maybe talk a little bit more about that that later um can I ask before you continue um so when you say influences the next token prediction so each head I guess does that by by adding uh to the value that's sort of received at a given uh entity but this particular head it basically just only predicts this or is it in I mean yeah how does influence work I said I'm saying influence here so they say it predicts that the next token is B okay but like to me that doesn't make sense because like the head doesn't happen my head does any prediction right like it happens way down the line so the head contributes to the overall sum by almost entirely contributing the the value B which goes in with all the other heads to actually make the final prediction but this particular head basically is just obsessed with this yeah yeah so so they break this down into sort of it like a head doing kind of two behaviors um and one is like where it's kind of focusing its attention so if it's its attention is mostly on this B and and
this is what leads into their way of Measuring Up and the second is um that it performs a sort of a copying um behavior um which they don't I looked so they mentioned this in an appendix that this is part of what's going on and it seems to form a part of the previous paper but I don't see where they're measuring any sort of copying um when detecting induction heads so really I'll mention this in I'm about to Define how they're going to measure induction heads but really what they're doing is they're looking at where this head is directing its attention um and the uh they're looking for heads that sort of focus on the token which is after um the previous occurrence of this token that's it got it um which is I guess which is not actually a perfect measure for you know something that is performing this algorithm it could be focusing its attention on it but but doing another type of computation so um so that's right yeah so it's more like oh yeah F of B like it's predicting some function of B but your point is that it's a function of B where B is this yes yeah yeah yeah and yeah and another thing so I hope I've made this clear this is like an algorithm right this isn't so the head should do this for any sequence of tokens like this it shouldn't do it for like it's not it's not memorizing that you know after um you know Michael comes Jordan right it's it's performing an algorithm even on random sequences of tokens um so that's the sort of the the main point of this um um so could you clarify about the algorithm um is is uh a and potentially B is either of those um fixed for this individual head or as a head that like looks at whatever the current token is and um then goes and finds the previous occurrence of that token so with the same head be able to apply this algorithm to Michael as well as to any other um yeah yeah so precisely the last thing you said so so the way they measure this is they Generate random sequences of of tokens um that you know exhibit some sort of repeating pattern so they can they can measure it um and induction heads need to be able to identify these these repeating patterns um so yeah it's not just about memorizing cool um okay so they measure the formation of induction heads using a a proxy called the prefix matching score so really this is what what everything is being observed in when they say in the paper they're observing induction head form really they're saying they're observing a change in the prefix matching score of particular attention heads
um so maybe this should be you know I guess this is where I'm thinking maybe an induction head isn't the right notice uh right notion going on here but this prefix matching score is an interesting quantity to measure um so given an attention head it's uh prefix that's not how you spell prefix it's prefix matsync's matching score so I guess I'll call this pmf um PMS um is computed as follows so first of all generate a random sequence of 25 tokens um they exclude the most and least common tokens from from this uh this pool of random tokens they're drawing on um repeat that four times and append start of sequence token um or as the start I suppose uh and then record the attention scores of each uh record the attention scores at each token right then the prefix matching score is the average attention assigned I guess you should write this down matching score is uh the average attention assigned to the token proceeding the current token in earlier it repeats okay [Music] um yeah this is the clearest statement of this uh score I could find in the paper I think um yeah it wasn't it wasn't sort of obviously so this was buried in um in an appendix uh and I think I suppose it is clear enough um but but yeah I guess it's a bit confusing yeah um right what's the what's the point of this starter sequence token I mean the training let's see this so all all sequences coming into these models have a startup sequence token at the start oh okay sure and they found this didn't work very well and they didn't put the startup sequence token on um which which was yeah sort of interesting and when the if you so they have some nice sort of visualizations where you can look at where you know when you're looking at a particular head sorry a particular token where some head is directing its attention and when that token hasn't appeared before so there's no opportunity to do this sort of induction head algorithm for an induction head right um the uh the induction head tends to look at the starter sequence token instead uh yeah okay that makes sense yeah it's actually funny Transformers sort of often use special tokens as like routing or like to do some kind of breaking of ties and such things so I guess that makes sense okay so you pre-prepend the start of sequence token and yeah all right uh yeah maybe it's worth saying briefly what a hit is because we haven't covered that in the previous seminars so we talked to previous seminars when we were describing Transformers in terms of a query key and value
weight matrices that convert a vector which is an entity to query keys and values and then attention is based on those and heads are just the way of parameterizing those qkv matrices so for each head you have a separate qkv Matrix that is a separate learned transformation that takes an entity and gives you a query key and value for you to do attention yeah yeah so our head is just something that does a attention calculation yeah um and then all the heads which have which work at the same time that's referred to as an attention layer I think I mentioned attention layers before okay well this is a sense yeah it's a sensible metric yep um cool okay so uh the other thing I want to talk about as well is before we move on to some results is how they're measuring in context learning um so oh yeah is it is it important when you said they they exclude um you said the most frequent and least frequent tokens is that an important thing or is it just a technicality I don't know um they mentioned they do this and they don't explain why do you understand what when you say the most like is it just the word the or I'm not sure if that's the most dug in but like is it just a single token that they remove or is it like some or something like that I read it as something more for second but I don't need more detail okay I guess it's because I a hypothesis that it made me think of is maybe like um maybe when things come up very frequently they get their own kind of special heads that are dedicated to those tokens and so maybe they're trying to um Force uh you know you can't afford to have a special head that handles um what comes after all but the most frequent or most important almost you know a small number of tokens so if they want to kind of isolate the the masses of tokens where you have to have these generic algorithms then they might make sense to tune those out I'm not sure about the least common tokens but um yeah yeah why I would expect you you do something with the most common something I wondered yeah with the most common is what you're actually doing is you're like if you're thinking in terms of the use of these tokens in language a lot of these are probably going to be things like you know sort of words yeah so it's often going to be attending to those sort of words for other reasons yeah um you know connectedness with you know some pronoun is referring to a particular person or something like that um uh but yeah but I don't know they did they didn't actually they didn't really
explain it [Music] um yeah [Music] X okay um so let's talk about measuring in context learning briefly so as we've talked about in previous Seminars the idea of in context learning is that the model should get better at next token prediction the further into the sequence it gets um so we Define um a measurement of this uh the in context learning score um is uh of a model is the average loss of um the 500th token minus the average loss of the 50th token for 512 token sequences test sequences um so note that a more negative in-context learning score should demonstrate a higher level of in-context learning um the authors claim that the arbitrary choices of 5 550th tokens don't change things much um I kind of think it's probably wouldn't be too hard to come up with something um better maybe they've got reasons for for not doing this but for example you could just average tokens the loss of tokens near the start and near the end and look at that difference or you could do something like look at the gradient of a line of best fit to the Token index versus loss graph both would seem to do similar things to me but um but not rely on you know something going weird or not not be sensitive to something going a bit weird at the 550th tokens um but that's how we're calculating it in in this uh in this one so just be clear you talked about two kinds of experiments at the beginning one were these this range of model sizes on natural language data and the other kind of synthetic sequence examples hey Europe the this score is applied to both hmm because it I don't know I mean certainly the the score here would depend I mean if you're just repeating sequences or something of course it depends on how long they are or something I feel like you want us to think about this measure mainly in the situation of the this range of models on kind of natural language to other right yeah I guess they didn't ex I come to think of it I either didn't see or they didn't explicitly mention what their sort of testing data was um but I don't didn't read it as being on on randomly generated data generally when they're using randomly generated like sequences of tokens in these models it's to compute a particular type of metric so the income the um the uh this uh what do they call it the prefix matching score um users you know they have an algorithm which involves generating a random sequence of tokens um and there was a there's another metric which I won't talk about which involves generating random sequences of
or selecting random tokens uh from from natural language um I assume if they didn't say otherwise that this should be based on on natural language um yeah yes I guess I could yeah maybe it's worth checking that but um that was my my interpretation um okay so any other comments or questions before we move on and talk about their first phase change um argument okay so let's go take a look I've got some figures here oh I timed perfectly um so let's go take a look at some of these figures here on on these three boards um so here we have uh the evolution of these various um these measurements throughout the training of um some of the small models that they have so on the first one we have the in-context learning score evolving over training where training is measured in sort of number of training tokens consumed um on the second board we have the in the prefix matching score for all of the attention heads of the model evolving over training and on the third board we have the the normal um training loss uh evolving you know as it changes over training um so and as you can see uh this is very nicely highlighted in the graphs we have this Sudden Change occurring in all all three of these metric measurements at the same time for um for models which have at least two two layers um and above uh I'll note that so for for a particular model um the orange window drawn on the graph is the same for each of these so so the window in the loss graph is the same as the window in the in context learning graph and the induction heads graph um yeah uh and I think this is yeah I guess this is sort of pretty interesting right um one thing I'll point out so to explain the middle graph a little bit more um so here we see multiple curves so we have one for each attention head um and I guess the interpretation of the authors here is that the induction heads are the ones that that that change that jump up quite High uh and that the other heads are doing something else um so uh you know not obviously not the whole entire model won't consist of only induction heads then some other stuff is going on in this model um but but we have induction heads forming during this during this phase transition um sorry let me just try and figure out what one means on the y-axis so you said the prefix matching score is the average like attention paid to an earlier token is that right and one would mean that that is the only like it's is it the entropy of that distribution or it's just the the so it's just the like the
percentage of the attention that is paid to the appropriate token yeah yeah so like attention attention scores the total attention is is sums up to one right allocated to all previous tokens yeah uh and this is the attention so I guess it's in theory it's attending to If you're looking in the last group of so the last repeat of this random sequence it's attending to if it's a genuine uh induction head it's in theory attending to three other tokens right in each of the three previous repeats yeah and it adds all of those together and and takes the average of them okay so if I see if I say 0.5 like I do for maybe two of these heads it's saying that I mean it's certainly doing that right like on average uh some maybe maybe since we don't have any particular reason to think any token is special it maybe means that at every position in that repeat of four times of the sequence half of the attention is on that the correct token from the point of view of the pattern and then it's paying attention to other things so maybe it's doing more than just uh what the induction head story says but at least that's a big part of what that hit is doing is that how you understand it yeah yeah that's definitely how I understand it um and I'll send a just after this so I just um once we've talked about these graphs I'll send uh the same graphs for the largest for the large models and you'll see for large models you get way more stuff Happening Here Right and yeah so um you know there's sort of calling something the idea of calling something an induction head or not an induction head is maybe not the right idea you know something is either like you know maybe it's like 80 an induction head or it gets a you know it's it's like a continuous scale right something can be anywhere between one and zero on its on its level of induction-headedness right so the induction is like some tool in the toolkit and each head computes I mean each head is maybe like some algorithm and induction headiness is like one part it might use to compute whatever it's trying to compute or something yeah certainly yeah um one thing I'll point out so in each of these the first graph which is boring and not colored in um is uh the the one layer the smallest model with one with just one attention layer so it's basically just doing an attention calculation for all the tokens and then trying to well it's doing the encoding and then decoding steps but it's just doing one attention calculation um so we shouldn't expect induction heads to
form in one one layer attention only models um uh this is because well if a model only has a single attention layer it can only use the actual value of the input token or some you know function of that to compute the attention score but induction heads need to perform an algorithm which look at the the token matching the current token and then the token immediately following it right so the only way for this to occur is for a previous attention layer to have encoded the fact that you know B follows a when it's looking at at a right in the in the updated embedding of of of uh of b um in the bigger models is it I mean it's a little concerning that the sharpness of the phase transition is significantly softer at least according to the second board here for the three layer versus two layer I mean if you start taking four five six seven maybe just there isn't a phase transition at all for the the bigger models is it still the case that there's a discernible actual transition yeah I'll just send this to SLT now um yeah thanks I didn't it's like a long it was more difficult to expect from the paper um so uh yeah if everyone wants to take a look at Discord I've sent a new plot um so these are for the these are analogous to my first to the first board um so the Inca in context learning score is on oh sorry no that's another second board so the prefix matching score is on the first access um one thing to note is that in the small models the graphs for the small models the uh the training the x-axis so the training time is on a linear scale and for the large models it's on a log scale so um there's been you know this this graph has been rescaled in some way um but yeah again you see you see this sudden this sudden jump and you see it with the with the loss and with the um with the with the in context learning score as well um so I I don't think it says um on that on that figure I sent so on the leftmost we have the 13 million parameter model and on the right most we have the 13 billion parameter model um and yeah things get messy but uh you still see something happen right at uh at this particular point hmm yeah a lot cool um another thing to point out so the way that's colored as well is Orange is so the the scale from Orange to purple is um earlier layers are orange and later layers are purple um so you can see that the stuff the things that are like very much likely you know that are strongly exhibit induction head properties tend to be earlier layers in the model
um which is yeah I guess interesting to to note um so maybe this is sort of higher level intuition might be something like just like con convolutional neural networks will in early layers have Edge detectors and you know texture detectors that patterns like this are kind of like the textures of language or something yeah yeah that that's that seems reasonable um Okay so what else do that there was some other interesting things to say about this um so I haven't put any figures up and I won't go into detail on these um but this sort of event this Sudden Change uh is observed in other quantities that they look at as well so it's observed in the derivative of the loss with respect to the Token index um and it's observed in this sort of funny measurement they call uh it's observed in the trajectory of the principal components of this quantity called the per token loss um I don't think her token loss gives you a good idea of how to compute that um but I guess the point is is it's not just these you know there's about five or six measurements that they're taking um which observe this phase transition at the same time um the other thing I think is really interesting and they've sort of just put it in there like right at the end and you know um is that the in context learning score for all models so for small models and large ones like toy models up to 13 billion parameters is approximately constant after the phase transition and it's also the same for all models like equal across the models um which is really weird right so the large models are still much better in absolute terms but this this observation is saying that their ability to learn uh to learn from Context is the same for all models right so um in other words the the advantages that the the the um the large models have over smaller models is occurs early on is is evident early on in the in the sequence and doesn't improve more than than the smaller models improve um I think I'm not following you so the the in context learning score you're saying that it plateaus roughly speaking after this phase transition uh Plateau is at a higher level for the bigger models than the lower models but they no no so so here so if you look at the the first the first graph here we have in context learning score finishing up at about you know just below negative 0.4 for all models for these two models and this is also true of every model they all Plateau at about just less than negative 0.4 oh so that's saying then that certainly when you improve the size of
the model it it gets better at prediction but whatever improvements are happening are not due to in context learning past this point is that what you're saying yes I guess I guess the way I would say it is making the model bigger doesn't improve its ability according to this metric doesn't appear to improve its ability to learn from Context hmm interesting okay so to be clear the bigger models are still way better like at actually doing next token for them um but you know the difference between the first and last token uh are still relatively constant which is weird so um so so it's like they they because when you when you think about sorry Matt I think you cut it out a circaded one so it's kind of saying they aren't noticing any other mechanisms emerge that are more sophisticated in as much as like they would allow you to get a better score on the 500 minus 50 test of in context learning um because you could think well what if you could do um what if you could do trigrams rather than bigrams like what if you could look at the pre the previous two tokens rather than just the previous one token um or you know any number of other strategies for um propagating patterns um depending on context so they're saying that wow I wonder if I wonder if that tests that 500 minus 50 it really matters if that is like how much that relates to what we think of as in context learning for us to read anything into this yeah I don't understand this at all I'm very perplexed so take the example of specifying arithmetic problems so the problem itself occurs within the first 50 tokens say and then the solution is later I mean maybe this you get the idea maybe it doesn't quite fit in this number of tokens but that sort of format the early part where you're specifying a problem is is basically very high entropy right you can't predict the next token of the question so you just have very high loss but presumably if you actually know how to solve the problem you can make a good prediction of the 500th token if that's where the solution is so learning can actually solve that problem does seem to actually mean that this into context learning score well that that actually makes okay so maybe I was reading this wrong uh in context learning score you would actually expect to keep decreasing right if that's the case uh with respect to what as of a training so is that right so I mean it is decreasing over training and if yeah I don't know how does this fit with the fact that you have phase transitions
where certain abilities come online and you're able to say do arithmetic that seems like it must mean this score is decreasing so it's not in the samples they use surely like if you if you tried it on those samples maybe you would see an improvement okay so this is like this flatness seems to be more about the data distribution than yeah I feel like I'm want to push back a little bit against the interpretation that bigger models don't do in context learning better or something or else seems to me like a very strong claim to make and maybe yeah typically they're not making this claim or observing this behavior in their in contact yeah okay [Music] um so that that you could say that they're sat you could maybe think that that their method of testing in context learning is saturated yeah by the biogram um predictor induction head solution well yes I mean the in context learning score doesn't doesn't refer to induction heads though yeah I think Matt means that their their data set sort of has that level of difficulty and no more so you sort of once again all right right yeah yeah okay uh yeah cool okay um cool so this third plot is they want this to say that uh something like when you increase the number of layers the reason it does better is in large part the formation of the induct induction heads or what's in their view more or less synonymous the uh emergent capability to do in context learning and this is why depth helps so is this is this the lost one the last one yeah uh no I think I think their point is to highlight that something is something big is changing in these models uh it's just to point out that it's easy to see the Divergence so that you can even just looking at the loss sort of be like aha there that's where the induction heads happened yeah and the only models except where they mess around with architecture the only the only models that don't do that don't form induction heads are one layer attention only models so they're just not expressive enough to to do this um you know they're very small um shall we move to the next set of board so we want to run on that one yeah let me just double check uh oh there was something oh um I think so Matt you mentioned that maybe this is sensitive to um the 50 and 500 so the the plateauing thing just took briefly mentioned so you mentioned maybe it's sensitive to the 5500 token definition in the in context learning score um they also mentioned this and say that they played around with those numbers and it didn't change things
um for what it's worth okay that makes sense RFA that my point was more like uh what is the what are the tokens rather than that you're measuring the 50th and the 500ths like if it's if the if the stream of text the stream of tokens is kind of not sophisticated enough that you get any advantage by doing anything more sophisticated than these induction heads that was my point yeah okay cool let's uh let's go take a look at I mean I'm actually gonna need to write my own boards let's stick with this one here and maybe we can yeah because it'll be useful to sort of maybe perhaps a bit for a bit longer um okay so now I'm going to quickly run through uh some of their other arguments um I think these are I think these um are less strong than their first argument in my opinion um so uh the first one I'll briefly mention yep sorry to interrupt um maybe now's a good time though because I I have um I have like another [Music] interpretation and I'd like to voice this so so that I can run it against um the other experiments and see if they clarify it for me but like if if uh if you give if you give me these tasks and ask me to write an algorithm for it then the one that they're kind of hypothesizing the Transformer is following with you look at the previous token and then use that to whatever came one token after that last time and you use that to predict what's going to come one token after it next time um did you say that the the patterns were kind of so simple that like they're set up so that this will work when when they are um anyway it doesn't matter I guess my my question is like well did they look for one once removed induction heads where you're looking at the second to last token and the last time the second to last token um appeared and the token that came two tokens after that and you use that tool um detect the the next one um that's like that's almost as good like it's it's only it only fails around like the edge cases where you can't use it to predict the second token in the sequence um because there's no second there's no token two before that um so but for all the other ones it works just as good so yeah it seems to me like they're searching for one specific type of induction algorithm and it could be that you know for all of their evidence you've shown so far they could have shown that these one-step induction heads appear but they haven't ruled out that other types of induction heads or other types of heads appear at the same time as well and um yeah so I'm I'm wondering if this is
like evidence that it's this and nothing else um yeah um I can I agree with that yeah I think it's uh I mean so they do they do observe induction hit so I guess it is a different way in which they could be slightly not capturing the full extent of things is they observe things um which they identify as induction heads which are doing things that are big go beyond this algorithm um which I'll talk about in just a moment as well um but yeah I think you're probably this is certainly consistent with with the experiments that they presented here um yeah so as I said at the start I think this didn't convince me that like induction and induction head is the right is like the definitive thing to be caring about um but it did convince me that that some stuff is you know that there's a particular point where it gains the ability to perform some algorithms possibly ones they can't capture um and also this coincides within context learning things like that um any other questions or comments before we discuss the other arguments which will be briefer cool um so the next one is facilitating induction heads uh so I'll briefly mention this one so they modify the architecture of of models to make it easier for it to form induction heads um so more specifically they add a mechanism which allows it to mix to learn how to best mix up key vectors in the same attention head so they you know if this is our new key at token index T at head H we have a trainable parameter alpha alpha um which interpolates between the sort of key computed in the usual way and the previous key um uh as I said Alpha is a trainable parameter here one for each head and when this is introduced the phase change in the previous um in the previous examples occurs earlier for models with more than two layers um and so what's the T here this is the entity index yeah token index okay but what's this got to do with induction heads if it's only paying attention to the previous key do you mean like the previous occurrence of that I don't know like this is just the the token immediately preceding this one right uh yeah actually that's a good point um oh no I see that point so so so say we've got a sequence uh of the form a b a right the head the head this is relevant to is the head that is looking here right and this mechanism allows us to move data about a into the encoding of B yeah I see um so I guess yeah provide this sort of map um which is yeah precisely why the fact that you can't do this you know in a one layer attention
model is why you know we didn't see those induction heads forming in one layer attention models um but when you do introduce this mechanism you do see it in one layer attention models um I think I'll point out this is uh only for small models I think maybe I give it a slightly different interpretation so it's easy for a token to pay attention to itself right so if you if you see a the query Vector for a is of course maybe quite similar to the key Vector that's back there on that a and uh okay maybe I don't understand I think I'm still a bit confused about why this helps but yeah all right yeah um yeah I didn't look in extreme detail about this this precise mechanism so I guess um yeah they're claiming that this makes it easier for induction heads to form and it certainly seems to according according to their metrics and moves the phase change earlier in training sure it's T minus it's T minus one it's not t plus one maybe it should be two plus one I could have copied that down wrong let's think so because what I'm thinking is that okay suppose query and key vectors are just equal they're not but you know paying attention to yourself is is actually something that these models do so suppose that when you see an A and you're generating attention to figure out how to compute the next thing um oh well okay yeah I don't know this formula no no no no no no minus one it makes sense yeah okay so because when you're generating the key for that token uh you incorporate a bit of the previous tokens key and that means that it'll have a large dot product with anything that looks like a assuming the query and key for a are similar right because yeah yeah okay that makes sense yeah so it's it's basically the Transformer native way of saying uh assuming that you know queries and keys are fairly have a high dot product for a single token which they often do this is the sneaky way of getting it to pay attention to the previous Next Step yeah that makes sense as you said yeah um the so the next I'm going to move on from this argument unless there was any other questions or comments okay so the next thing they do is they remove induction heads at test time so they train the model um and then basically delete various attention heads during testing um in a slightly complicated way to try and make the calculation still make some sense and then try and attribute the amount of in context learning that each head was responsible for um I really wish they had gone into more detail about this
um I I guess I don't really understand the method here um uh and exactly how they're attributing in context learning and how they're sort of I I guess maybe it is just as simple as just deleting an attention head as in like you know making its effect on the on the what they call the residual stream the the sort of update on the the token the current token uh zero um but yeah I think I was a bit confused about about what was going on here um any comments or anything we want to talk about with that before we keep going well what was the conclusion so it does it does reduce the in context learning and it oh what was that they claim proof they claim it does um I don't really like understand the like well enough so they're like rather than just showing a plot that has you know here's what happens when we knock out a induction head versus a non-induction head look look in context learning changes by less um they sort of show plots that given attribution of the amount of in-context learning okay um yeah I'm quite I'm pretty confused they can still do other other things though right what's actually not in context so it's like they remove these heads and then I can't do anything and they're saying oh look I can't in contact sun now because I took out half its brain [Music] well that's sort of what I'm wondering is like like I would barely deleting yeah of course that will break it you know like which is I don't think they would have done this and published it with what's happening um yeah but like yeah I'm sort of confused by it um and a little trailer yeah I wish there was more effort to detail about this um cool any other comments about a Matt did you have a question yeah um let's assume that they do have evidence that these induction heads um that they've identified um are responsible for in context learning in the sense that when you remove them um it differentially uh sorry differentially um lessons the model's ability to perform in context learning then that would evidence for that would have would um update me to thinking that this they found the mechanism for in context learning rather than just a mechanism for in context learning because um so it could be that these heads are doing something else and it they haven't like quite fully understood what these heads are doing but if you remove the heads and you remove the incontext learning then it can't be that there's some other heads out there that are doing in context learning in some other way I guess so that's
could be that those specific heads uh doing some other thing and that thing is what is responsible for in context learning and this this particular metric is just sort of coming along for the ride but not the key part of in context learning but presumably that's yeah presumably the plot they're giving is showing that the heads that are most high scoring according to their pattern matching metrics yeah are the most responsible like have the biggest effect when you hit them in terms of their effect on in context learning and that would be them the result that you're looking for Matt I think that's right that's and that's my presumption based on the description of these plots I guess they would be doing something like that yeah it makes sense um oh yeah one thing I'll point out is that I only did this for small models um for large models this thing they're doing is apparently completely in completely not feasible to do computation one other thing actually um is that if so so you can have an assumption that like this piece of your algorithm is represented by this head um but I don't think the Transformer has to like be that modular so if if uh if in context learning was actually implemented algorithmically by Lac a bunch of small contributions from a bunch of heads um in the like 0.2 Parts rather than 0.8 parts that correspond to the this um prefix matching score metric um it would still be able to get around and not be kind of detected by this kind of methodology I think um yeah I I think depending on the specific uh layout of the like the the specific um way that it was distributed across heads if it may or may not be detected but yeah I guess I'm I'm still thinking there might be other explanations for for this evidence um cool yeah so I guess so I just quickly looked up to see how strong they were stating things in this section uh they conclude at the end of this section that they feel comfortable concluding from this that the induction heads are the primary mechanism for in context learning in small attention only models but see this as evidence see this evidence only as suggestive for the um MLP case so the case of small models with fully connected layers [Music] um sorry I missed the distinction between um what was the MLP case and the not MLP case for attention only they see this they conclude that that it is the primary mechanism for in context learning yep could you quickly uh remind me what's the difference between attention only versus MLP architecture
thinking to like the architecture of say like a GPT style model it has like an attention layer and then it has a fully connected layer and then okay right so attention layer has only attention no fully connected layers got it yeah thank you um and then yeah for maybe obvious reasons they're much easier to sort of understand um yep cool any other questions about or comments about this part cool um so the last well no it's not even less the next thing is uh that induction heads uh do other things right so in this section including induction which is not what we think it is um induction versus deduction I guess not one two three four five um okay so in this in this section they're looking only at their large models so 13 million to 12 billion parameters they don't look at their 13. oh no sorry 12 12 million to 13 billion parameter models here um and it's basically sort of qualitative observations of what certain induction heads are doing um so I wouldn't read to this too much into this as like you know hard evidence like it relies on someone sort of spotting something interesting and making an observation um so but basically what they're doing is they're identifying their induction heads using this um this prefix matching score and then they're looking at what they do on on certain examples of text so they're looking in their in at one particular induction head within their 40 layer 13 billion parameter model um they look at where attributes attention on um on a on a sequence which is the same sentence in in three different languages so a sentence in English the same sentence translated to French the same sentence translated to um to I think it's German um and what it does is it allocates attention in the same way you'd expect an induction head to do it but also with this translation component so if it looks at a you know the word in French then it looks at the word after the translation of that word in the English sentence which protect it right so this is an induction head that is doing or is in some way doing something a bit more than just simple induction uh you know it's not literally reading the sequences it's applying some other transformation first um so I think it's other things it's worth noting that the the test that they're using to detect their head is a problem that's in a sort of General class which includes that behavior but it's it's already a form of generalization for for these two things that even I mean for the head that's doing that to show
up as high scoring on that metric is already an indication that it's a general algorithm right it's it's not not really that similar strictly speaking except when you look from the point of view of the abstraction that pre I mean getting a high score on that metric and and doing that task that's sort of not not that similar um so it's it's very interesting that you can detect something that seems to us like it's yeah we can look at that and sort of understand why you might call that an induction head but it seems quite remarkable to me that it shows up as a high scoring head on that metric um yep um yep uh the other thing they do is they sort of have a sort of synthetic pattern matching um example so they give it sequences that look like uh things where the first word is either a month or a color the second word is either a animal or a fruit and then to each of those four combinations um it's uh it has to assign zero a label zero one two or three so I guess that's sort of synthetic classification kind of task um and they're doing it's doing something similar here um I recommend maybe playing about with the the visualization it's sort of difficult to exactly describe but um the this one particular head in one particular large model is is doing sort of this more complex pattern matching task in in this in this thing as well and a head that happens to also be an induction head um so I guess I I wanted to mention these because uh I think it supports the idea that will these heads that they're identifying with the um the the metric uh the the prefix matching metric uh particularly in large models probably doing something which is more complicated than their definition of an induction head um although for you know the particular way the the um you know they still perform this algorithm in in sort of simple random token sequence things um any comments on that before we before we move on okay um so is this when you first said inductive other things I I thought that was like a throwback to what we were discussing earlier where it's like not 100 an induction head it's like also it's more like 80 induction higher than 20 something else so is this different is this something that those induction heads are um is it that 80 of the induction head that you think is doing something else here or do you think it's the 20 that is doing something else here if that makes sense um yeah I really don't know I mean like can I can I restate the question and maybe Matt can agree it's it's a better
or worse form of the question so in the in the case of the translation that's clearly like a general behavior in the same sort of category as what we narrowly defined induction heads to be and the meaning induction head maybe is a bit fluid whether it refers to a sort of General capability that's a bit like that pattern matching but maybe it's you know only recognizable to us at an abstract level and maybe not mechanically exactly the same um and is there a question Matt like is it 80 a general capability in that class and then 20 some other unrelated stuff like looking for full stops is that are you in yeah yes yeah I think that's my question um I appreciate it probably Uncle what the answer is but yeah yeah that's fine okay yeah I guess the way they're presenting this is look um in a narrow con you know in sort of when the data is such that you know really all you can do you know on these repeated sequences of random tokens when really all you can do is like this induction head algorithm then these heads do this task and that's reflected in their prefix matching score but hey if you give it a more complex input so more complex data coming in it's able to do a more General version of this algorithm um I would say that is how they kind of present this um I guess okay yeah so maybe maybe you did something like because it's the same head that if you is it am I right that it's the same head and if you put it if you've watched that head while you're doing the prefix matching um data input uh it does one thing and then without changing the weights or anything you then switch the input from the previous matching data set to the translation data set now that same head does something similar in this new context in that case it's like the 20 must be like changing the behavior or maybe it's like the rest of the model around it I'm not sure what something could be something go on the thing could be something something totally different right like yeah um yeah it could be the the the like it could be for this translation task like sort of maybe ignoring difficulties with translation like word ordering changing and things like that but just sort of my head like you know it could be that the layer immediately before is just mapping you know is literally is just a dictionary which changes English French or something right um okay and then looking for that dictionary and just does induction heady kind of stuff yeah that's how I translate to German so that makes sense to me