Transformer is a powerful neural network used for tasks such as image classification, language modelling, and text generation. It takes a sequence of tokens, applies layer normalization and attention, and outputs a sequence of probability distributions. It enables more complex tasks such as asking a question and receiving an answer with few-shot learning. Transformer is a statistical model with a prior baked into the weights, and uses cross entropy loss to calculate the true distribution of tokens.

A parameterized family of functions, such as a neural network, can be used to produce a distribution over a discrete set. This is written as a sum of logarithms and allows us to model the probabilities with the parametrized family functions, leading to the soft Max distribution. Logits are vectors used to model a distribution of m categories or classes, and the softmax takes a vector of size m as input and outputs a probability distribution. This mapping from logits to probability distributions is called the softmax and the set of all probability distributions is a Simplex, where all the probabilities must be between 0 and 1 and sum up to one. Logits can be used to map this interior to a probability distribution.

Neural networks are commonly used for classification tasks such as image classification and language modelling, where the model outputs a probability distribution which is then matched to the truth by a cross-entropy loss. KL Divergence is a measure of difference between two probability distributions which can be used to compare a model's output to a true category for a given input. It is calculated by taking the integral of the difference between the log of the true distribution and the log of the model distribution over all inputs and outputs. Cross entropy renormalizes the KL Divergence by getting rid of the infinity that would result from the logarithm of zero.

Transformer is a neural network that takes discrete tokens as input and produces a vector of continuous values. The vector is used to create a family of distributions, which are then updated with context from other tokens. After processing, an unembitting step and a softmax are applied, resulting in a sequence of probability distributions that represent the probability of each token. KL Divergence is a measure of the difference between a model and the truth, but it is impossible to make it equal to zero in classification problems.

Transformer blocks are neural networks which take in a sequence of tokens and output a sequence of conditional probability distributions of the next token given the tokens up until now. Time is manifested in terms of which entities a given entity is allowed to attend to or receive information from. Attention is an important part of the Transformer which allows tokens at position T to receive signals from previous tokens. The Transformer block consists of layer normalization, attention and a feed forward part. Attention works by mapping the token at position T to a vector, adding a positional embedding and forming a matrix which is processed multiple times. The entity at position T receives signals from the previous entities.

In context learning is a method of generating text using a Transformer such as GPT-3. It uses a sequence of tokens as a Prompt to predict future tokens by applying layer normalization, a linear transformation and a softmax. This allows for the prediction of multiple tokens from a single context. In the framework of Statistical Learning Theory, it is formally defined as a combination of a space of parameters, a true distribution and a loss function. This enables more complicated learning tasks, such as asking a question and receiving an answer.

Transformer is a statistical model with a set of parameters (theta) determined by a training procedure. It uses a cross entropy loss to calculate the true distribution of tokens from a set of sequences. The prior is baked into the weights using glorut or hair initialization with a gaussian distribution centered at zero. In context learning allows the model to predict the next token in a sequence without updating the weights, and is referred to as few-shot learning. This can be used to fix parameters in a mixture model, unlike a nested model where parameters are not fixed.

In context learning is an emergent capability of large language models or Transformer models. Regression and classification are two methods of machine learning which are both measure spaces. This talk will discuss in context learning in more detail, with the speaker discussing the distinction between regression and classification and then introducing Transformer models. Finally, the speaker will explain in context learning in the context of statistical learning theories.

A parameterized family of functions, such as a neural network, can be used to produce a distribution over a discrete set. By Bayes rule, this can be written as a sum of logarithms. This reformulation allows us to model the probabilities in this formula with the parametrized family functions, leading to the soft Max distribution. This becomes a genuine mathematical issue when considering the boundary of the space of probability distributions.

Logits are vectors used to model a distribution of m categories or classes. The softmax takes a vector of size m as input and outputs a probability distribution. This mapping from logits to probability distributions is called the softmax. The set of all probability distributions is a Simplex, where all the probabilities must be between 0 and 1 and sum up to one. The interior of this space is where all the probabilities are strictly between 0 and 1. Logits can be used to map this interior to a probability distribution.

Neural networks are predominantly used for classification, with the model predicting classes by the formula of softmax of f of xw. This is reflected in tasks such as image classification and language modelling, where the model outputs a probability distribution over possible tokens. This is matched to the truth by a cross-entropy loss. Logits can be recovered from the probabilities by taking logarithms and differences between the entries of the logits. Softmax is injective, but there is a degeneracy when adding a vector of constants.

KL Divergence is a measure of difference between two probability distributions. In machine learning, it can be used to compare a model's output to a true category for a given input. Cross entropy renormalizes this by getting rid of the infinity that would result from the logarithm of zero. There is still a question of how to deal with boundaries, but the cross entropy is still a useful measure.

KL Divergence is the measure of difference between two probability distributions. It is calculated by taking the integral of the difference between the log of the true distribution and the log of the model distribution over all inputs and outputs. The numerator is either one or zero depending on the value of the output, and the denominator is given by the expression on the left hand side. Finally, the KL Divergence is given by the difference between the log of the true distribution and the log of the model distribution.

KL Divergence is a measure of the difference between a model and the truth. In classification problems, it is impossible to make the KL Divergence equal to zero due to the one hot truth assigning zero probabilities to all categories apart from one. The denominator of the formula is calculated using Bayes rule, and the distribution of X is assumed to be QX. Minimizing the loss is the goal when fitting a model to the truth.

A Transformer is a neural network that outputs a vector used to produce a family of distributions. This vector is the output of the model and is viewed as a distribution over possible words. Fong and Hata's paper provides a formal algorithm for Transformers, and a picture can be drawn to illustrate how the model works. The neural network takes some number of tokens as input and the vector is used to create a family of distributions.

Transformer is a machine learning model that takes discrete tokens and transforms them into continuous vectors. This is done through an embedding step, followed by multiple layers of Transformer blocks. These blocks update the vectors with context from other tokens. After this processing, an unembitting step and a softmax are applied, resulting in a sequence of probability distributions which represent the probability of each token.

Transformer blocks are neural networks which take in a sequence of tokens and output a sequence of conditional probability distributions of the next token given the tokens up until now. Time is manifested in terms of which entities a given entity is allowed to attend to or receive information from. The original Transformer from 2017 consists of an encoder and decoder, and the vector of parameters is denoted Theta. The input is a sequence of tokens and the output is a sequence of probability distributions. Hyperparameters such as the number of layers are used and the slice notation is used to denote the input sequence.

Attention is an important part of the Transformer which allows tokens at position T to receive signals from previous tokens. The Transformer block consists of layer normalization, attention and a feed forward part. Attention works by taking the token at position T, mapping it to a vector using the embedding matrix, adding a positional embedding to it and then taking all the entities together to form a matrix. This matrix is then processed multiple times and the entity at position T receives signals from the previous entities.

X is a matrix that represents the current representation of an entity. To update this representation, a linear transformation is applied to the entity, and the output is split into three vectors - query, key and value. A softmax distribution is used to determine which values are more relevant and propagated with a higher weight. Finally, a feed forward Network is used to refine the value. X is preserved throughout the process, allowing information to be routed between entities and updated over multiple layers.

In context learning, a sequence of tokens is used as a prompt or context to predict future tokens. This is done by applying layer normalization, a linear transformation and a softmax to the representations of each entity. The most common application of this is with GPT-3 which has 96 layers and 12,000 dimensional models. This allows for the prediction of multiple tokens from a single context.

In context learning is a method of generating text using a Transformer such as GPT-3, where the initial sequence of tokens, known as the Prompt, is deliberately engineered to form a pattern. The Prompt is fed into the Transformer which then predicts the next token, taking it as the context to predict the next token and so on. This can be used to create more complicated learning tasks, such as asking a question and receiving an answer. In the framework of Statistical Learning Theory, in context learning is formally defined as a combination of a space of parameters, a true distribution and a loss function.

Transformer is a statistical model with a space of parameters (theta) determined by the training procedure. The true distribution of tokens is determined by a set of sequences of tokens (e.g. several thousand tokens) and the loss is calculated by taking the cross entropy of the one-hot distribution against the conditional distribution output by the Transformer. The prior is baked into the initialization procedure for the weights, usually glorut or hair initialization with a gaussian distribution centered at zero. This model can be formulated in SLT.

In context learning refers to the ability of a model to predict the next token in a sequence, without further updating of weights. This is distinct from weight updates, which are used in other forms of learning. GPT-3 paper refers to this as few-shot learning, where one shot is better than zero shot and so on. This can be used to fix parameters in a mixture model, where some of the entities are fixed in the attention. This is different from a nested model, where parameters are fixed.

welcome everyone today we're going to talk about in context learning which is one of the emergent capabilities of large language models or Transformer models that was discussed last time in the connection in connection with scaling laws and today I want to introduce what in context learning is and we'll see how far we can get in formulating this kind of learning uh in the context of statistical learning theories are a bit more formally so the outline for today's talk is that I'm first going to just make a few remarks about regression versus classification because I'm still angry at reviewer 2 from one of our papers for making a big deal out of this uh a definition of Transformer models and three uh then we'll be in a position to talk about in context learning in some more detail okay regression versus classification so most of the time when we've been talking about neural networks in singular learning theory it has been thinking about neural networks as a class of functions from which you can construct a class of models and those models are attempting to perform regression so in machine learning a distinction is made sometimes between I'm not going to write these again this is r and C okay regression is about modeling a conditional probability distribution where y the output is continuous and classification is about y discrete now maybe there's some more fundamental reason to care about the distinction between these two things but from a theoretical point of view say the point of view of reading watanabe's book we really we really don't care they're both just measure spaces at least that's my take Maybe Edmond can correct me um I guess like in the context of something like um decision trees um there will be really different um at least when we go get down to the nitty gritty details of how to implement things but but yes from kind of a model point of view yeah I agree with you that yeah so from a high level theoretical point of view what kind of measure space your wires live in is just a detail that is hidden and you just take it to be a major space and then you formulate it's just wherever you see an integral it's now a sum but who cares you just write integral q y d y and the measure takes care of which one of those two it is uh as Edmund said from a practical point of view or an implementation point of view it may make a very big difference uh and that's uh to some degree what I want to talk about now but I just wanted to make the point that at a high level you

can read watanabe's book and it talks also about classification if you like and there really isn't a distinction at that level of theory okay so now let's turn more precisely to the class of models or regression models that we're often talking about so we're often taking a parameterized function with parameter space w and then we formulate what we might call a regression model if we wanted to use one of these terms and that's p x y given w something like this right so we've had many talks in the seminar series that start from this form of conditional probability distribution and then perform some kind of analysis on it and this is how you would go from this kind of parameterized family of functions to a class of regression models so here Y is some continuous variable ranging over RM so now I want to explain how you might go from instead from a parameterized family of function say a neural network which is often what we're considering how you would go from the same starting point but instead of producing this distribution of RM to produce a distribution over a discrete set and we might as well take the discrete set 1 through M which I'll write an ordinal notation like this okay so I'm going to motivate this with a little bit of uh I wouldn't say it's a derivation exactly but uh kind of Bayes rule argument so we're leading up to the soft Max which we've sort of arisen earlier when I described Transformer models but now I'm going to sort of introduce it more carefully okay well if we're to have some distribution why over this discrete set then by Bayes rule we have this now since the values of Y are finite we can we can just write this as a sum and so just by substituting all right so the point of this reformulation is that uh we can instead of thinking about this distribution of our discrete set think about this family of continuous numbers uh or rather they're logarithms so we can model with our parametrized family functions f the probabilities that arise in this formula so this PX joint with y equals J's we're assuming they're all non-zero which we will actually this is a bit of a bit of a technical point that maybe non-trivial actually in SLT so I said that you know who cares whether they're discrete or continuous distributions the place where I think this actually becomes a genuine mathematical issue is to do with the boundary of the space of probability distributions which you don't have to worry about in the case of continuous distributions but actually

plays a role in discrete distributions if you're interested Tom wearing's thesis has a bit of a treatment of this you have to resolve the boundary of the space of parameters in addition to singularities that properly treat the rlct in these settings so this is like actually an irritating and potentially meaningful complication I don't want to get too far into that today so I'm just going to assume that all these probabilities are non-zero or what's the same that we're in the interior of the space of probability distributions okay so if I want to model this distribution of a m categories or classes I can model the logarithm of these probabilities and I'm going to call those LJ and it's traditional in the machine learning literature to call these logits I don't know if that's an older statistical term or where this came from exactly but this is a widely used terminology so the logits as you if you take them as a vector so what possible vectors can you get out of so if you imagine all possible conditional probability distributions over the ordinal m and then you take the logarithms of these probabilities what possible vectors can you get out well so let's investigate that a bit so if we rewrite the original probability distribution in terms of these logits what we've got is What's called the soft Max so the softmax takes as input a vector of size m and outputs a probability distribution on this m element set where a tuple LJ J1 through m becomes the probability distribution where take the exponential of LJ so the exponential of LJ is of course the numerator on the previous board and then the denominator on the previous board is the sum over I equals 1 to M all the exponentials okay so this denominator is the same for every J and it's just the numerator which varies with j so this mapping from logits to probability distributions is called the softmax a question so far all right okay so this is subjective onto the interior what do I mean by the interior well the set of all probability distributions so is a Simplex I think about all those the constraint so it's a tuple of non negative real numbers less than or equal to one which sum up to one that space sitting inside RM it's boundary inside RM is going to be where any one of the probabilities is zero so the interior is the set of probability distributions where all the probabilities are strictly between 0 and 1. so that interior if you give me any distribution in that interior then clearly you can take the logarithms

and construct some logits which map onto that and I'll more precisely so if you give me a tuple all of whose entries are non-zero then to construct the distribution right so so given some uh so we start with these logits then you can recover the logits by noting that if you take PJ to be the jth entry in the distribution then log PJ so that'll just give you LJ minus some stuff and the stuff is the same for LK so you just get a cancellation okay so you can recover from the probabilities by taking logarithms and differences the differences between all the entries of the logits but hopefully you're starting to see that this map is not injective far from it right because if I take soft Max of some Tuple l and I add some Vector of constants and then the C's are all going to cancel out right I'll get this common factor of e to the C in the numerator and denominator so even before we get to degeneracy in the model the F part of constructing these conditional distributions and we already have this degeneracy just arising from the soft Max okay so uh but it's easy to see the proof's in the notes it's not very interesting and that apart from this degeneracy softmax is injective so if you consider the diagonal subgroup of R to the m which just see as this constant vector mod out by that subgroup so Identify two tuples that differ by such a constant vector then the soft Max is a bijection that quotient with the interior actually I don't know if that's a homeomorphism with the quotient topology it probably is okay uh so the conclusion now is that given a parameterized family uh parameterized we can also give a classification model it is that predicts classes by the formula py given x w is soft Max of f of x w and this is the predominant way that neural networks appear in say language modeling actually most of deep learning is about classification rather than regression that goes back to the original tasks in which deep learning excelled which was image classification where M would be the number of possible categories for images cat dog train Etc uh through to the modern day where language models are predicting tokens so which token is the next word well the model needs to predict well it outputs a probability distribution over possible tokens and that is the prediction which is attempted to be matched to the truth by a cross-entropy loss okay so this is how neural networks give rise to models in the context of deep learning predominantly it's classification now I want to make a brief remark about

that loss that I just mentioned so what happens if I take the KL Divergence of a model produced in this way with a one hot distribution as you'd call it in machine learning which represents the true category for a given input so that's my next that's the next few minutes but are there any questions about the soft Max first all right [Music] foreign let's do a little calculation so let's let's take the characteristic function of the jth class and suppose we're using this model and the true distribution for a given input X is this one hot Vector uh as you would see it formulated in in say Pi torch or tensorflow okay well if I take the KL Divergence yeah maybe I shouldn't say KL Divergence uh I mean the problem with there's a few technical problems here right we shouldn't we shouldn't really allow this thing uh if I want to hit the kale Divergence against this and I'll end up with logarithms with zeros in them so I should put tiny Deltas in all the zeros and one minus Delta or something let's just leave that annoying question for later actually I don't know the the correct way of dealing with that uh I've in the past theoretically considered that by putting those Deltas in and taking limits um that is legitimate but actually I don't this is related to the earlier technical question about boundaries and how to deal with them and something to come back to at some point uh so the cross entropy just basically renormalizes by getting rid of this Infinity we'd get from the logarithm of zero the cross entropy remember is where you you've the KL Divergent says this numerator in it right um maybe I'll remind you okay so that's the KL Divergence and you can split it into a part that has this numerator which is negative the entropy and a part that just Compares PX to QX like this and that's negative the entropy of Q uh Plus uh I'll just call it CE for cross entropy okay so the cross entropy of this one hot characteristic function distribution with our conditional distribution okay my iPad just crashed a moment please let's say you Edmund what's the principled way of doing classification and avoiding these infinities I was wondering whether or not uh it's well defined since we are it's a zero log zero scenario isn't it yeah that's right so obviously it's not well defined but the question is what is the right way of thinking about this I mean I guess I was actually saying it is not not well defined no it's not not fun oh I see it's like this yeah it's wrong like taking a limit from

the positive side because we are dealing with positive um yeah numbers that's that seems to be okay yeah yeah I guess uh you just have to be very careful manipulating Expressions if you have this special definition of zero log zero I guess right yeah but yeah maybe that's right okay um so if I compute this conditional distribution so what is this well uh it's going to be minus now uh so this here I've got well okay this is a bit confusing because I've written it with just an X up here so maybe I need to get rid of this and start again a little bit um because there's there's an X and A Y in the distributions I'm talking about okay maybe let me just write it as the KL Divergence and then uh just take Edmond's suggestion to blithely delete the terms I don't like okay so we have I mean the way we think about all these distributions fundamentally is this joint distributions over the input and output which we refactor as conditional distributions so this by definition is what the KL Divergence is between our model and the truth um um uh wait wait okay this is this is for all X's I'm yeah I shouldn't do it like this uh yeah the J should be a function of axis yeah J should be a function of a function of X here okay so let's do that did you mean to swap the given that symbol and the p ICS uh which P expression oh yeah yeah that that is meant to be that way uh I I'm gonna put it back to the way it was in a second so which means it should be that way on the left hand side as well I think oh sure yeah yeah it's like different yeah okay yeah yeah that's fine yeah thanks all right okay uh now this thing here this q y given X is what we're assuming well yeah I guess I need to put like a QX over here or something right um this is one okay so we're only going so this integral over y just collapses because I have this Delta function here right uh so this is really just an integral over X and then this numerator here is either one or zero depending on the value of y well this integral over Y is really a sum first of all right so that numerator is either one or zero the zero terms I'm ignoring uh because thick I mean ignoring the zero log zero terms and so I'm left with 1 over P the correct class for that given input DX okay so finally this is uh well what is p y equal j x given X W uh well that's given by the left hand board right and that's given by this expression here so if we just substitute that in what we get is f x f x w j x minus the log of the denominator all right so this is the KL Divergence

between the model and the truth and it's the quantity that or the loss and it's the quantity that we would estimate based on a sample and try and decrease in order to fit the model to the truth okay uh notice that with this particular true distribution this is unrealizable because you'll never make this actually zero right so this quantity here is is negative and if you take it's exponential you'll see that it's zero that is the KL Divergence will be zero uh exactly when zero is equal to the sum of the I is not equal to J X uh e to the f x w i for all X and the right hand side is a there's a positive number so you'll never make this equal to zero uh you could make it realizable I guess by perturbing this a little bit uh but this is the part where I I'm actually unsure about how to deal properly with this so uh I guess a classification problem from a pure Bayesian perspective is just mistaken fundamentally you can't have the truth have zero uncertainty in these other categories or something like that so maybe the answer is just to reject the question but I don't actually know the best way to think about that Dan I got a good dumb question yeah yeah how'd you go from here to here uh yeah just using Bayes rule in the numerator and denominator oh you all right that space were involved I was just okay yeah cool thanks yeah so I guess the the denominator is right yeah and there's I'm replacing the PX we're assuming the distribution of ax isn't being modeled so that's just QX yeah oh yeah that's right all right PX given W is just Q in that case yeah that's right okay cool thanks other questions maybe a a question or I don't know maybe more of a sharing which is that I in the it's an interesting case where the continuous world is kind of colliding with this with the discrete I'm I'm accustomed to uh transforming between distributions that agree on what is possible and what is impossible um like that's a requirement for a um red on nicodin derivative to be to exist between two distributions um but in in this case uh you know there's a there's a dis there's um the the one hot truth is saying that everything is impossible apart from one category given zero probabilities so I I don't know I'm I don't want to hold you up but carry on but it's um yeah I agree I also find this I mean I suppose it makes sense from a a more applied point of view uh because well I just want to minimize my loss and you know I could put the tiny numbers in there but we wouldn't change the direction I actually go

to to minimize the loss um but I yeah I don't know how seriously to take it I guess so you can just put the Deltas in there and get a kale Divergence and then maybe that has a valid limit as Delta goes to zero and then you call that your your KL Divergence and and maybe there's a more principled way in which that is correct um yeah this this is kind of an active theoretical issue for me because uh with Tom Waring his former Master student of mine and James Clift another former student we we care about singular learning theory in this setting uh for sort of considering models based on continuous relaxations of of Turing machines and and things like that and it yeah it actually is it's hard to know how seriously to take it I I feel a bit unsettled about my my current level of uh I mean I think I I haven't I think I could do various technical things and make the problem go away but I feel like I don't have a a satisfactory way of thinking about it yeah thanks for the comment okay so um the next thing I want to do is to give a definition of Transformers so the reason I preface that definition with this discussion is that the Transformers will make use of the soft Max so the Transformer is a neural network it will output a vector and that Vector will be used to produce a family of distributions which are the things that are viewed as the output of the model in terms of a distribution over possible words okay so we're going to move over to the next set of boards to to give the definition of the Transformer uh you actually will move around uh completely around here all right so as you can see I've pre-populated this with an algorithm that I took from uh Fong and hata's paper from the Sierra I believe the references in the notes that I posted in the Discord I think it's formal algorithms for Transformers is the title of course they didn't invent transformers but as a mathematician if you try and read the Transformer paper you'll be left wondering what all these terms mean and this is the only self-contained reference that I know that really spells things out in a way that is uh seems nice to a mathematician so a highly recommended reference but before we get into that reference itself maybe I'll just draw a picture of of how you're supposed to think about a Transformer model and I've drawn this picture before but I think it Bears repeating uh before we get into the the formal definition so a Transformer is a a neural network the neural network will take as input some number of tokens

say l so a token is a an element of a discrete set so you can think of it as a a category in the in the following I mean in the previous discussion so tokens uh are elements of some set like that they would be roughly speaking parts of words the first step a Transformer is going to take is to transform those tokens into something continuous so that's the embedding step and the other board say exactly what all these steps mean and they're sort of color-coded with green and blue uh after you've changed the tokens into vectors so an embedding is a a learned mapping from discrete tokens to vectors after those vectors are produced they're then processed through multiple layers of uh the actual Transformer What's called the Transformer block and it's not a mistake that I'm not drawing the arrows from right to left I'll get to that in a minute so in the original literature these were called Transformer blocks I noticed that these days people just refer to them as layers uh and I'm going to say more about what they are but for right now I just want to say there's going to be l steps that look like this so iterated Transformer blocks to take those original vectors Associated to the tokens and kind of update them with information from other tokens so they sort of incorporate the context as is how people usually think about it and so once that processing has been done uh there's then a an unembitting step which is a term I hadn't seen before Fong and had his paper I quite like it uh which doesn't map back to tokens maybe that's a reason to not like the terminology but uh it's just a a final feed forward well there's just a learned mapping that is applied to every token and then there's a soft Max okay so this is all one thing well maybe I shouldn't draw it like that the soft Maxes are all independent okay so each of those vertices is going to have a value after the calculation which is some Vector in RM m is the number of possible tokens and then we apply a soft Max and what we've got at the end of that is a sequence of probability distributions and the way to read this uh so this is the teeth entity so that's one two up to l the teeth entity the probability distribution that you get out at the end you should read it as the probability of uh I guess this is what I was calling y before right yeah with apologies I'm going to switch to X just because that's the way the formal definition is written so the probability distribution you get out at the teeth column is interpreted

as the conditional probability of the sort of next token being a given value given all the previous tokens okay so then you can see how this just you just Factor The Joint distribution over all the tokens into a product uh of all these conditional distributions okay so to explain what's going on here I need to tell you what the Transformer block is doing and why for example these arrows only go forward and the way to think about that is that time goes to the right so strictly speaking there's no time in the model right all the tokens are fed in all at once and time is only manifested in terms of how you allow entities the name for these vectors that sort of appear Associated to these nodes is entities time appears only in the convention about which entities a given entity is allowed to attend to or receive information from so that's how time appears now you interpret that in the condition situation of text for example as being earlier text comes earlier in time right sort of reading from left to right token by token okay so uh if there's are there any questions about this diagram before I go and give a formal definition of these ingredients okay so let's take a look back here at the first border there's many variants of Transformers uh the original Transformer from 2017 is now referred to as an encoder decoder Transformer uh there's Bert there's many variations some of them are enumerated in the paper I'm just going to focus on the one that appears in gpt3 because that's the situation in which we've discussed in context learning for example last seminar and the decoder on a Transformer works as follows so it's a neural network the vector of parameters is denoted Theta the input is a sequence of tokens X the output is going to be a sequence of probability distributions which as I said is interpreted as a sequence of conditional probability distributions of the next token given the tokens up till now you'll notice that they use in the paper this kind of python notation so we would write x sub T and I did that on the previous board they write X bracket T and where we would write X1 dot dot dot x t they write this kind of slice notation this various hyper parameters um maybe won't stress on all of them but you've seen capital L that's the number of layers uh yeah let's just skip over those for now okay so here's how it works L is the length of the input sequence as I said before each T in bracket L and again this bracket means the same as what I said before okay so you Loop through all the

incoming tokens and you apply this rule to the token okay so how does this work so this this notation means the teeth column of WP now WP is one of the things in this parameter vector so any unexplained notation so WP w e Gammas and betas on the next board the curly WL that you can see over here for example what else these wmlps and B MLPs these are all vectors or matrices of Weights that are actually trained during the training process of the Transformer so those are parameters okay so we have some Matrix of parameters WP and the teeth column is what I'm extracting by this notation and I add that to the X teeth column of w e w e is the embedding Matrix and This this term I've just underlined is this thing that I I mapped the teeth entity to in my picture right so to go back to my picture for a moment so this was the teeth entity it gets mapped to the vector which is this w e colon x t right that is the starting Vector which is associated to the Token at position T the token at position T is x t okay so what's the sum uh the first the first term in the sum hopefully I've explained that's just the the vector which represents that token the second thing is a positional embedding which allows the network to know where in the input that token is uh and I'm not going to it's important uh but I'm not going to say much about that it's kind of irrelevant to the discussion about in context learning which I hope to get to so the result of adding those two is the teeth entity ET then you you take all those entities together you get a matrix that's X and then you process X and this processing you do L times that's the Transformer blocks that I was mentioning so stuff here is the is the Transformer block foreign I don't want to say anything about layer normalization it's not very interesting for our purposes although it is essential to make the whole thing work uh the the main conceptually the Main Ingredients here are the attention and this feed forward part so attention I've described uh a few times uh maybe I'll mention briefly how it works again in a moment um this is where entities entity T receive signals from entities 1 through t so that's what this masking business is about here and it receives signals okay maybe maybe let's describe this on free space if we can find it here here okay so how does attention work all right so you've got entity at some some layer and that's going to be updated from all the previous entities so it's supposed to receive information somehow from them

and the way that's going to work uh well let me the algorithm it's called X so let's so X is a matrix and it's teeth column is the current representation at this layer of the Transformer of the teeth entity right so somewhere back down here at the beginning of time was this e t and then it was processed and now it's this thing X colon comma t okay so how do I aggregate all this information to to generate a new version of the representation for the teeth entity which is the X uh sorry yeah well I won't give it a name so first uh I'm actually remember what it's called in the Fallen hotter paper but um so it's probably wqkv something like that so take the ith entity you apply a linear transformation to it and you extract and you split that output into three vectors which you think of as query key and value I'm doing the one head version there's multi-head attention that's what the MH stands for uh but let's forget about the multiple heads it's it's not much more complicated so then what you do is you uh create a soft Max distribution so the teeth entity is going to uh with some Lambda apologies I was using I before for this okay so that's the new value that gets put that's the output so to speak for what's going on here uh so this thing here is a matrix it's a teeth column is going to be what I just uh what I just described here okay Lambda is some parameter uh just for technical reasons it's the square root of the size of these vectors the q's and K's but it's not really important for us so the way to think about this is that you extract from the current representation in the in this layer you extract some value so this is VT VT minus 1 V1 and you propagate those values to update the representation for the next time step and the ones that are more relevant are propagated with a higher weight and the relevance is determined by how aligned some query Vector is for Q for t with the key vectors for every other entity and those which are strongly aligned will have higher dot products and contribute more to that sum so in this way the Transformer routes information between entities in order to update them over multiple layers okay then there's this final step which is just a feed forward Network so there's a final feed forward MLP in this case just one layer uh Gail you it's not important it's just some non-linearity so it's kind of a refinement of value uh yeah what else to note about this uh one thing to note is that the original X is preserved somehow right so what you

actually learn is the Delta so you learn what to add to the original representations in order to compute the next layer um okay and then I haven't quite finished describing the Transformer so after you've done these L layers then you come to this part here so what do you do you apply layer normalization one more time to all these vectors one for each entity and then you apply a linear transformation to each one and then apply the soft Max and that gives you these distributions over tokens all right any questions how many layers are common oh good question um so it depends on the application if you're asking about GPT I think it's I'd actually have to check let me do that that's a good question well um it's I can easily again I don't want to hold hold things up I can look it up afterwards memories of uh I think it's probably a several hundred to a thousand that order of magnitude uh yeah Okay so gpt3 uh hundreds so GPT three with 175 billion parameters has uh 96 layers so no I was off by an order of magnitude that's uh so 96 layers the number of heads is 96. the the dimension of the model so I think that means the size of the qks and V's is 12 000. thank you thanks yeah other questions [Music] okay so now I have the vocabulary to Define in context learning so I don't think the first time I noticed this terminology was in the gpt3 paper which is the title is language models a few short learners from open AI maybe it's I think it's a pre-existing term but they certainly um that's that's really the focus of the paper is this this capability all right well what is the context well if we have token string X which is X1 through x t then in predicting uh token t plus one X is the context right so if we if we go back to the previous board uh the context is just all the things that are somehow the the things on which you're conditioning to make the next prediction so uh in this box here you're predicting this token and all these tokens are the context now as you move to the right the context is getting larger right now while that's strictly speaking correct uh when we're talking about in context learning generally what people mean is is the following kind of refinement of that where some initial sequence of tokens is is viewed as the context or the prompt so uh so you would say you have some initial token string and then in predicting token t plus 1 t plus two t plus three and so on the context is X now what does it mean to predict all those tokens well so here's how you would think about it

so suppose you start with some number of tokens I guess I've written it t or stick with that okay so you run those through the Transformer and you get out these probability distributions and then you sample right and now you've got a new token and then you put that back in the beginning and then you you run the machine again that gives you uh a new distribution and then you sample from that distribution and that gives you a particular token x t plus two and then you take that token and you feed it back to the beginning and you you run the Transformer again so that's how you make a sequence of text using say gpt3 so if you open up gpt3 on open A's website open ai's website and you type some text into the box and you click generate what's happening is that the text you enter is this initial sequence of tokens X1 through XT so that's called The Prompt and then it's going to generate the rest by first predicting the next token and then taking that as the context to predict the next token Etc right so in context learning is the special case of this generating text where the prompt is sort of deliberately engineered to be a family of data points in some sense for a pattern that you wish the model to uh to learn so okay so that's the context in context learning refers to the situation where the prompt is explicitly I mean it's hard to say exactly what counts as in context learning and what doesn't obviously but at least we can say that we call it in context learning whether when the prompt is explicitly designed to have input output pairs or some kind of pattern all right so for example maybe X is the following sequence of tokens uh so let's imagine some function which takes a string and returns a sequence of tokens I'm just putting in new lines okay so that's a sequence of tokens and gpt3 so that would be the uh The Prompt and then you would finish by sort of framing a question so you would say something like six plus two equals n string and then if you do the generation part uh yeah hopefully it will give you the right answer but you can do it quite complicated um learning tasks in this way all right so that's in context learning and the question I want to pose and and then we can discuss this so what is the formal definition of in context learning thank you in the framework of statistical learning theory say yes say singular learning theory and we're used to in SLT formulating this in terms of triples or quadruples right where you have a space of parameters you have a true distribution

you have a model and you have a prior so what are all of these things uh in this situation um yeah let me see if there's anything more I want to add before we just start figuring out if we can answer this question um maybe I'll remind you what the I hope it's hope it's clear that the the Transformer itself is a statistical model so the space of parameters is the space of thetas right so if we go back to the definition of the model so all these weights the value of all those weights is the Theta the true distribution is the uh is the sequence well the set of sequences of tokens so how yeah maybe I should say how a Transformer is trained in order to you can infer the true distribution um well you can infer what the definition of a set of samples is from the training procedure so usually you would fix fix the context length so say for gpt3 I think it's on the order of several thousand tokens okay so what is the loss well for each position in the sequence you generate this conditional probability distribution P of x t plus one given X1 through x t and then the loss is a sum of all those positions of the Cross entropy of that distribution with the true one-hot distribution of which token appears at that position in that sentence so you you take a a set of sentences or sequences of strings from say the internet English language uh that gives you many sequences of 2000 tokens and then you know at each position what the true token is and you take the cross entropy of that one hot distribution against the conditional conditional distribution output by the Transformer on that sequence of tokens and you add over all the positions those cross entropies and that's your loss okay so the true distribution is all these conditional distributions coming from actual text whatever you take that to mean um the model well I've described how the model is computed how these conditional distributions are defined the prior I guess is baked into the um not normalization initialization procedure for for the weights so generally speaking it would be something like um actually I'm not quite sure what they use these days but it's it's probably you would look it up under glorut initialization or hair initialization some kind of uh gaussian distribution centered at zero with a standard deviation depending on how deep it is in the model and okay so that's the Transformer itself as a so it's pretty clear how to formulate that as a formal model in SLT uh but yeah in what in what sense is this

like a nested model this um this in context learning really any thoughts about that I mean I can try and think out loud I really genuinely haven't thought about this um not even clear on what the space of parameters is uh foreign is like a nested model where you you fix parameters in so you have a mixture Model A mixture of 10 gaussians and and then you you fix some of the parameters and that's a that's an again a model um by so what you're doing by fixing these input output pairs is uh you're kind of fixing some of the entities that have fed into the attention right um so it's but so you're not exactly it's not like a nested model in that sense where you're fixing some parameters you're you're doing something a little bit different um maybe I can I have a question yeah so the the term in contact for learning that they refer to the fact that we that the model can perform uh okay is it referring to the phenomenon that uh one shot is better than zero shot two show is better than one shot Trisha is better than two shot Etc or is it referring to the phenomenon of um the fact that you can use the model this way as in make it predict make it predict the next token the second part is a much more abstract concept of um the training procedure of just predicting the next token um somehow learn learn in the very colorful sense um other skills like addition um yeah uh actually I find that the way that these terms like few short and one shot are used to be quite confusing I think uh so I believe in the in the gpt3 paper they actually do refer to the situation I've just outlined as maybe this is a few shot um let me just double check yeah that was the sense which I was using it yeah yeah they call it few shot but this I mean I think elsewhere in NLP this also refers to just where your uh this has some other meaning I think potentially um I'll release that yeah what's the name for it when you you take oh no no that's that's like top K okay maybe it doesn't have some other meaning um yeah sorry I guess repeat your question I guess a kind of a um a more concrete question here is that the the learning the way of learning in context learning has nothing to do with training I think there is no more extra updating of Weights yeah that's correct yeah so a nice diagram that they they draw in the I think it's the first page of the paper is and they draw us horizontally like weight updates and then they draw vertically this in context learning uh just to make clear that these are not related so this is this is giving