Transformer networks are a powerful type of neural network. They can detect signatures of phase transitions, such as the acquisition of a map, and use contexts to modify their weights. Free energy formula can be used to calculate the free energy of certain parameters and evidence suggests this can be used to map contexts to weights. Understanding the routing and subtasks of neural networks can help us interpret them, potentially leading to improved performance.

Data distribution for Transformer models is generated by sampling consecutive tokens sequences of a given maximum length from a corpus. Predictions are made using the same weight vector for samples of different lengths, which is not typical in SLT. The true distribution contains sequences of all lengths, with each token pair forming the input and output. The loss associated with a given weight vector is the sum of losses obtained for each input length, calculated by taking the average of cross entropies over the true distribution of the next token given the input and weight vector. Contexts as Weights Hypothesis suggests that inputting a context to a Transformer with a certain distribution of tokens can lead to a similar prediction as a Transformer with a different weight Vector W'.

KL Divergence is a measure of the difference between a true distribution and a model, and is important in large language models. Model selection is the process of selecting models with higher evidence, and in the context of large models like the Transformer, this is done by minimising the scale divergence and taking a gradient step with the empirical loss. This is related to the hypothesis of contexts to weights, which suggests that for different lengths, the samples may not always be equal, and is known as phase transitions.

The free energy formula can be used to calculate the free energy of a region of parameters that satisfies certain conditions. Evidence for this is provided by a sample of sequences of varying lengths, where the free energy is determined by the output of a transformer. It is proposed that the weight of the transformer may depend on the context of the task, and that this mapping is continuous. This suggests that the context of the task can be used to modify the weight of the transformer.

A Transformer could potentially replace a context-dependent weight, known as an FCWA Star, due to its degeneracy. This requires integrating over DW5W and taking the product over the data samples, which is complicated by the degeneracy of FCWA. A more complex routing Singularity can send queries or direct predictions to regions of white space that are suited for a given sub-distribution or context, which is done by applying the free energy formula to both sides of the equation. This could be used to support the idea that contexts get mapped to weights, and that effects like this will be seen when integrating over weight space.

Data from two contexts, C1 and C2, can be represented in a single data set, DN. Degeneracy occurs when the two contexts depend on disjoint sets of weight variables in a neighborhood of w Alpha star. This can be expressed as an integral with two contributions, one from the singularity at fc1 of wa star and one from the other place. This provides insight into the phase structure of a large language model, suggesting possible predictions for other regions. Degeneracy may also help recover the original weights from a model, introducing fake degrees of freedom.

Free energy formula controls phases and phase transitions, which can be understood by looking for a description of the overall system in terms of contributions from each singularity. During training, contexts are recognized and sub models share the same parameter space. It is possible that a different set of weights for a Transformer can be discovered, which could have consequences for the free energy, and could be interpreted in terms of deformation of singularities. Further research is needed to explore this hypothesis and interpret it in terms of maps from parameters to parameters.

Interpretability of neural networks can be broken down into two parts: understanding the routing and understanding subtasks. Recent research has looked at the link between phase transitions and the acquisition of structures in the network, potentially matching a theoretical picture. Transformer networks can be used to detect signatures of phase transitions, such as the acquisition of a map, although it is unclear how simple a model would have to be for in context learning to emerge, likely requiring some form of long-tailed data distribution.

Contexts as Weights Hypothesis suggests that inputting a context to a Transformer with a certain distribution of tokens can lead to a similar prediction as a Transformer with a different weight Vector W'. This was explored in the last seminar and the speaker argued for evidence of this hypothesis, using ideas such as Singular Value Decomposition and RG flow. It is assumed to be valid for now.

Data distribution for Transformer models is generated by sampling consecutive tokens sequences of a given maximum length from a corpus. The true distribution contains sequences of all lengths, with each token pair forming the input and output. Predictions are made using the same weight vector for samples of different lengths, which is not typical in SLT. The set of possible tokens is on the order of 40,000.

The loss associated with a given weight vector is the sum of losses obtained for each input length. This is calculated by taking the average of cross entropies over the true distribution of the next token given the input and weight vector. The model is specified as the Transformer, and the KL Divergence is calculated to show that it is the same as the loss function, up to the entropy term.

KL Divergence is a measure of the difference between a true distribution and a model. It can be rewritten as a sum over disjoint spaces of inputs and outputs. This is significant in a large language model, where different input lengths are mixed together and the same set of weights are used to compute predictions for them. The interaction between the KL Divergences for different inputs is one of the key points to consider when looking at language models.

Model selection is a process of selecting models with higher evidence. In the context of large models like the Transformer, model selection can be internalised by selecting between different regions of parameter space and comparing the evidence for each. This is done by minimising the scale divergence, which is a difference of entropy and cross entropy. This is done by taking a gradient step with the empirical loss, which is the negative logarithm of the Transformer output given a sample.

Model selection can be viewed as an internalized process of phase selection, where two regions W2W1 are preferred according to evidence. This is done by taking the negative logarithm of an integral to calculate the free energy, which is the effective Boltzmann weight for a coarse grained version of the distribution. This process is known as phase transitions, and for large language models, it can feed into special features of the posterior and the free energy. This is related to the hypothesis of contexts to weights, which suggests that for different lengths, the samples may not always be equal.

The free energy formula states that if a region of parameters satisfies certain conditions, including an optimal parameter that minimizes a function, the free energy of that region can be calculated. This can be used to reason about the behavior of a singular model, although it is not always rigorously justified.

In a sample of sequences of varying lengths, the free energy of any region in parameter space is determined by the output of a transformer. The formula for the evidence is an integral of the product of the outputs of the transformer. If the context weight hypothesis is true, the partition function can be calculated by running the transformer on the input and taking the position in the output.

A hypothesis is proposed that for some samples of text, the weight of the Transformer depends on the context of the task. This means that for any given weight in the weight space, there is a weight depending on the context that can be used to compute the output of the Transformer on the input. It is assumed that this mapping is continuous, so that if the hypothesis is true for one weight, it is true for all weights in the neighbourhood. This suggests that the context of the task can be used to modify the weight of the Transformer.

A hypothesis suggests that a Transformer can replace a context-dependent weight, which is known as an FCWA Star. This weight is likely to be highly degenerate as it is able to deal with multiple different contexts. This means that there will be many directions in weight space which are irrelevant to the task and the derivatives of FCWA will be zero. A calculation of the model evidence for this region involves integrating over DW5W and taking the product over the data samples. However, due to the degeneracy of FCWA, this is not straightforward and for every point there is a pre-image which must be integrated over.

Integrating the posterior over a region requires a factor corresponding to integrating the measure over the fiber. This factor, Phi bar, is likely to contribute to the integral and make it worse if it is degenerate at the most important point, FCW*. This is because the prior, which is positive, does not contribute to the asymptotics.

The idea is that a more complex routing Singularity can send queries or direct predictions to regions of white space that are specially suited for a given sub-distribution or context. This is done by applying the free energy formula to both sides of the equation, showing that the rlct that controls the free energy near W Alpha star is strictly less than the rlct that controls the level set that passes through fcwa star. This is likely due to the difference in the complexities of the two functions, and is a measure of the generacy of the map. This could be used to support the idea that contexts get mapped to weights, and that effects like this will be seen when integrating over weight space.

Degeneracy of maps between sub models can be thought of as contributing to the free energy of a large language model. Introducing fake degrees of freedom may help recover the original weights from a model. Degeneracy can provide insight into the phase structure of a large language model, suggesting possible predictions for other regions.

Data from two contexts, C1 and C2, can be represented in a single data set, DN. The simplest kind of degeneracy in this data set occurs when the two contexts depend on disjoint sets of weight variables in a neighborhood of w Alpha star. This is reasonable as the weights are attention weights, which can be used to process different entities. The two contexts use different entities, such as cars and colors, to process the information.

The integral splits up into a product of two factors, with one depending on U coordinates and the other depending on V coordinates. When there are KL divergences which are a sum, the integral will have two contributions, one from the singularity at fc1 of wa star and one from the other place. This shows how the phases and phase transitions of the mother Singularity can be understood, with the training process for the Transformer controlling the asymptotics of the integral.

Free energy formula controls phases and phase transitions. To understand these, a divide and conquer approach is used, looking for a description of the overall phase transition system in terms of contributions from each of the daughter singularities. Each of these singularities has its own phase structure, and the complexity arises from the many hundreds of them having superpositions of their phase structure. During training, a given context is recognized, and this is an example of a phase transition. Sub models share the same parameter space and number of dimensions, and the free energy adds in the energy and entropy terms.

In context learning is an exploration of the possibility that a different set of weights for a Transformer can be discovered. This could have consequences for the nature of the posterior and free energy, and could be interpreted in terms of deformation of singularities. It is uncertain whether this hypothesis is true, and further research is needed to explore the simplest model in which to test it. Furthermore, it may also be possible to interpret in context learning in terms of maps from parameters to parameters.

Interpretability of neural networks can be decomposed into two parts: understanding the routing and understanding how to deal with each subtask. Recent work from anthropic and elsewhere looks at the link between phase transitions or something that looks like phase transitions in various measures of effective dimensionality and the acquisition of structures in the network. This could match up with a theoretical picture, though it is difficult to argue from. It would be suggestive if one sees the order in which certain structures are learned and appear in the training process and generalized susceptibilities which appear to diverge.

Transformer networks are capable of computing on arbitrarily many tokens, and can be used to detect signatures of phase transitions, such as the acquisition of a map. It is unclear how simple a model would have to be to observe in context learning, and it is likely that some form of long-tailed data distribution is required for in context learning to emerge.

welcome everyone this is either the second or third seminar talk on in context learning depending on whether you count the one that was some months ago uh so I'm going to begin by briefly recalling what we did last time just to set the stage for the discussion today and this is all speculative in in some sense right so this is meant to be a stimulus for investigation okay so you can see here um maybe here rather the key hypothesis from last time which was this question two and it is the hypothesis which following on from one of the comments last time I'm going to call um the contexts as what context two weights hypothesis just for the sake of having a more useful term of reference than question two contexts two weights and this was the hypothesis that given a transformer for some kinds of distributions of tokens if you prepend a context in this case X underline the behavior of the Transformer in prediction on that task is in some sense similar to a Transformer with a different set of weights sorry the context is X underline XL plus one so that whole sequence and the the next thing you're trying to predict um so the you're trying to make a prediction based on what comes after that that's the blank and the hypothesis is that with that context underlined in green the behavior of that Transformer is similar to a different Transformer that is parametrized by a different weight Vector W Prime on just that new input data all right now that's probably not true for all distributions it's probably not true at the beginning of training for a Transformer Etc so this is it has to be rather carefully understood and maybe even very carefully understood is false but it's a we're going to explore this hypothesis and what it might mean for the structure of phases and phase transitions for a transform model in today's talk um let me see if there's anything else I want to say from these boards before I again yeah so the second half of the last talk was me arguing in a you know a pretty rough uh not very serious way for some evidence for this question too or at least how it might arise you can do better than what I did last time um using ideas like singular value decomposition and some RG flow ideas but that's a project for Ben and Jung Tien to tell you about it some future date I will pass off that problem so that's not meant to be a full justification of question two it's a question whether or not that is valid but we're going to assume it for now any questions about this part before we

move on okay so I'm going to start today by formally putting the Transformer and more specifically the Transformer plus the data distribution you see in say the GPT paper so you could say the large language model perhaps which is this combination put this formally in the context of SLT so um if you'll follow me we're going to move to a fresh set of boards Over Yonder so let's go for a walk all right so I need to say formally what the data distribution is for this class of models so given some Maximum content size context size C so that's around 4 000 tokens for the model you'll find called GPT 3.5 on open AIS um service for example so the way we generate data is we sample consecutive tokens sequences of length C from a corpus EG of text if it's a language model so let's call those tokens T1 through TC then we form the pairs which is the first token comma the second so this is the input and this is the output in each of these pairs the second uh sample well the second thing in the true distribution is the sequence T1 T2 followed by the next token T3 and so on all right so that's how you generate samples from the true distribution note that sequences uh in this true distribution sequences of all lengths are equally likely right so it's not that there are more short samples in the sense that the length of the input is short and there are not more short samples than long samples the reason to spell this out is this isn't I guess as far as I can tell detailed in the fuang Hada paper which is what we're basing the formal discussion of Transformers off and we haven't covered it in the earlier seminars either and it's crucial for actually formulating uh despair consisting of the Transformer model and the true distribution as a learning machine in the language of SLT so if Hess is the set of possible tokens which if I remember correctly is on the order of 40 000 I think for say gpt3 we actually have true distributions and I'll call them I'll denote them with subscripts subscripts so this q1 X Y that's on pairs right so that's something like this we have Q2 x y on S squared cross s and so on okay so note we make predictions in the notation I used last time so TW means the Transformer with a tension weights W and again I'm ignoring the rest of the weights in a Transformer model now we make predictions for samples that look like that and predictions for samples that look like that all using the same weight vector this is a little unusual right where the so in SLT we would typically call this x

right the space on which the true distribution is a distribution uh so even though we're really considering somehow a family of true distributions over different uh context lengths maybe context is the wrong word input lengths we're making predictions using the same weights okay now as defined in say the phone hudder paper the loss Associated to a given weight Vector W is the sum over the losses that are obtained for each one of these lengths think I probably wrote this slightly differently last time so maybe I'll spell out what the ingredients here are okay so uh the first board defines an effective distribution over the disjoint Union maybe I'll write this somewhere so the actual SpaceX over which the true distribution lives is the I could write it like this the disjoint Union over l equals 1 to C minus 1 of s to the L Cartesian product with s and the true distribution lives on that and is given by all these things on the disjuncts so all right so the first board defines effectively uh all these distributions and then the loss function is the average over that true distribution of the Cross entropies where this this here is just a um has all its probability mass in the true next token oh sorry it's um given by that distribution of an X tokens given the T underlines you see in the data and empirically so if you take a sample the empirical approximation to that loss function is I think what I put up on the board last time all right uh so the learning machine w p q Firefly is specified so we give told you the true distribution I've told you x w is the space of attention weights and I don't care about the prior for now so once we I have to give you the model and the model uh just is the Transformer so as usual we don't we assume that the um the distribution of inputs X is not modeled so that just factors off and this part here is uh just run the input sequence through the Transformer read off the T Prime entry of the output of the softmax that's the probability of that token given that input and that weight vector so that's it so that's the formulation of the Transformer plus this form of data distribution as a statistical model next I'll compute quickly the KL Divergence for you and show you that up to the usual um entropy term it's the same as this loss function and then we'll start talking about the posterior are there any questions okay [Music] so I'll be a little bit quick with this because it's a calculation that many of us have done numerous times but I do

want to at least quickly go through uh why once you have a learning machine you can formulate the KL Divergence between the true distribution and the model and why that reduces to this loss function whose empirical approximation we've discussed last time as being the loss function the loss you would associate to a sample in the usual machine learning sense all right so the KL Divergence given parameters W is by definition the integral over this space of SpaceX and maybe one reason to do this is that it's quite common when you're looking at SLT for the first time to be a bit confused by the fact that in say the gray book uh you'll see things like PX given W you won't usually see things like conditional distributions like we're dealing with usually when we're modeling neural networks so a little wrinkle here okay so that's the KL Divergence by definition we can do a couple of things so we can rewrite this as T Prime given t times QT we can do the same with the numerator and denominator inside the logarithm which replaces this by this and by the hypothesis I mentioned before that we're not modeling the distribution of x gives us this so we get a cancellation now we observe that X is a disjoint Union over spaces so I can write this as a sum so the main point I want to make here is that it's interesting to think about the interaction between these different sum ends right the losses or the Gale Divergence is made up of contributions from each of the input lengths and obviously the phenomena of in context learning or this contexts as context to weights hypothesis relates these KL divergences or loss functions for different input lengths is one of the ways in which thinking about a large language model and I'll use that term henceforth to refer to the combination of a transformer and a data distribution of this kind which mixes inputs of different lengths but uses the same set of weights to compute predictions for all of them so a large language model from the point of view of statistical learning theory is interesting for several reasons and one of them is is this phenomenon mixing different lengths and the interaction between the kailed vertices for those links all right so we get a sum like that and I'm going to turn that sum and KL that will be less confusing in the notation than it is to say out loud okay superscript l meaning the KL Divergence for that's kind of like a I don't know if I want to call it a sub model exactly but um for the model so that's the kale Divergence for the

model with the same set of Weights but the the model and true distribution being restricted to sequences of linkville foreign well that's the KL Divergence so you can just split that logarithm into a difference and you'll see that that is uh entropy term the entropy of the true distribution or it's negative Plus what I read earlier right what's left is just the the cross entropy okay so in particular minimizing minimizing this uh the scale Divergence means minimizing at this loss okay so now I'm going to talk about the Bayesian posterior associated to this model so let me just recall the definition so given a set of samples now what your mind should be going to is the fact that we'll have you sample from this distribution you'll get a distribution over the input lengths as well right so you could kind of divide any sample into some samples each length so it means there's a natural decomposition of the posterior as a product of contributions from each of those links and so given a set of samples varying lengths posterior is Bayes rule and over the partition function the prior and then this here where lnw is the empirical loss which is exactly the loss you would write down the thing you would take a gradient step against for the Transformer given a sample by the procedure that I defined earlier so the empirical losses the cross entropy uh actually wrote this incorrectly in the notes the second formula is correct so we would just look at the earlier formula for for l if I get a sample then that's some particular T underlines and given a t underline uh I'll get some precise next token and so the uh what I'll get instead of that cross entropy is exactly negative the logarithm of the output of the Transformer thank you okay so let me briefly recall how model selection is supposed to work in the context of learning so the model evidence which is this denominator then it hasn't moved across sorry oh thank you so we view model selection I don't know not here to convince you that this is the correct way to choose a model but anyway the Dogma is that review model selection as being about selecting models with higher evidence um if you accept that view then it makes sense in the context of very large models like a Transformer uh perhaps to view model selection as internalized so rather than selecting between different models we're really selecting between given regions of parameter space which we integrate over and compare the evidence for and that's the point of view that's

emphasized if we talk about phases and phase transitions in a learning machine so we view model selection as being internalized there's a single model so you might call it phase selection where two regions w2w1 are preferred according to their evidence maybe this is a misuse of the term but uh so for a and one two you can compute this integral and see which is larger or what is the same you can take the negative logarithm and see which is smaller and that's the free energy of course and one way to think about the the free energy is it's the effective boltzmann weight for a coarse grained version of the distribution with you partition W and integrate the posterior over each cell of the partition okay so I'm just setting up the general story of uh sort of a phase is going to be some region of weight space satisfying some conditions which we won't talk about here when we say model selection what we're referring to ultimately is where the posterior concentrates for a given value of n right as you increase in the posterior will shift around it will prefer certain regions of weight space to others uh those preferences when they jump those are called phase transitions so what I'm leading up to is a discussion of how these peculiarities of a large language model meaning the peculiarities of the Transformer model itself and the nature of the data distribution how these feed into special features of the posterior and the free energy which are not generic which are somehow special to large language models and perhaps related in an interesting way to this hypothesis question two or the contexts to weights hypothesis um maybe I'll pause for questions are there any questions about this um foundational stuff then I will proceed hi uh yeah yeah can you hear me I can hi um I had a question about the the data sample DN yep uh in this case um so I could imagine a situation where for different lengths L small L you have different amounts of um samples that makes sense I'm not sure exactly what you assume here but you can imagine there's some complicated combinatorics um yeah uh the way that the Transformers usually trained you have equally many samples of each length because your first sample a context and then you generate right okay I guess of each length yeah okay okay all right so it's yeah so it's equal hold on thank you I think however as I'll discuss a bit later that effectively uh if if the hypothesis about contexts and a question two is correct that it's not always the case

that you're really sort of equally training on all lengths and that's consistent with a I mean a kind of naive view of training a Transformer or any model is that it will tend to go for the low hanging fruit first so it's probably easier to train next token prediction given one previous token just looking for sort of naive correlations than it is to do some more complicated thing with many tokens in the context so maybe maybe it's reasonable to have an intuition that even if the data sample contains things of all lengths that like at various stages in training maybe maybe shorter things are more making more contribution to the gradient or something I'm not sure yeah I mean your entropy rate for your effective entropy rate that you'll see will also go down the larger your context window is this would vary as well right I'm not sure what you mean by effective entropy rate I well so um if I if I only see like three previous uh tokens um there's like a limit to how much I can predict about the next goal if I can increase the focal length then and and that itself like that's a objective fact about reality that will vary again yeah yeah that would be yeah that's interesting okay so part of what I want to do uh with this seminar is illustrate how some of the kind of basic fundamental theorems in SLT can be used to reason about things uh It won't always be completely rigorously Justified and here's an example so one of the key identities that you can use to think about what's happening uh in a singular model is what we've termed a free energy formula it's developed in watanabe's not so much in the gray book the way we're terming it because that's assuming the realizable case but in the wbic paper and in the green book uh in circumstances for example the renormalism renormalizable case you have the following formula I'm going to apply it a bit outside where we know it's Justified so this is dependent on like ongoing research but it's at the level I think of like reasonable physics talk so the free energy formula says the following it says that uh says that and here there's another typo in my notes I apologize it says that if you take one of these regions and they satisfy some hypotheses which uh real technical and those hypotheses sort of include there being a point in this space the Subspace of parameters where the uh which is an optimal parameter so it minimizes among all the parameters in this set the um this function Ln and then there's some further conditions it says that the free energy that I just

defined is approximately so asymptotically in n given by n l n w a star Plus Lambda a log n where this is to do with the singularities of uh well let's say k minus k w a star right so this is not assuming that this is a point of w0 so I'm not assuming this is zero this is not the zero level set and these are not the singularities of w0 the zero level set these are singularities some other level set and Lambda a is is a measure of the complexity of those singularities okay so this is a very important formula so roughly speaking this is familiar it's an energy term plus an entropy term okay so I'm going to take this for granted in the following discussion okay so we take a sample it's got in it sequences of all different lengths and each of those points in the sample contributes to the free energy of any phase any region of parameter space so let me just write out what the free energy uh is what the partition function is sorry Dan when you say singularities of K minus K of w star yep what's the input for K is it just arbitrary w uh yeah W in this set w a that's right so wa is a subset of the the space of parameters w and by this function here what I mean is the function that sends a weight that shifts the origin of the output gotcha thank you yeah so it will be zero at w a star now over all of w that function may be negative right because it's certainly uh if we're imagining a phase which doesn't correspond to a region near w0 then it's certainly true that there are other weights W that make that difference negative because K there is less than k at this particular point but on this wa this function should be non-negative so that's what I kind of said earlier right that w a star should be a local minimum of the um of the energy cool thanks yep all right so let's just write out uh in the formula for the evidence on the previous board there was this Ln there and we have a formula for Ln in terms of the outputs of the Transformers let me just substitute that in uh that we get znwa whoops it's the integral so the product over all the tuples in the sample I run the Transformer on the input and then I take that position in the output okay so well now I'm going to think about what happens if the context weights uh hypothesis is true okay uh I'm just realizing that I've forgotten to feed the cow and if I don't feed the cow you're going to hear moves behind me every 30 seconds soon so please excuse me for two minutes while I go and throw some hay to the cow all right I'm back

you've never seen a seminar interrupted for that reason before okay so let's continue with this assuming this hypothesis is true so suppose context the weights hypothesis true for some sample DN now I'm going to later on invite you to consider the I mean any given sample if you think about a real Transformer it's sampling text from all over the place some parts of the sample will come from some sort of sub distribution corresponding to a certain task for which we might imagine that there's a context or a set of contexts that make this context a weight weights hypothesis true approximately true it's very unlikely to be true for a whole sample DN but just keep in mind some you know just restrictor some sub sample for which is true for example so that's how we're going to proceed again we're going to suppose that each one of these inputs looks like some contexts they underline a sequence of tokens that identifies the task and then the actual input and to say it's true is to say that if I compute the output of the Transformer on that and look at the entry for the true token that it's approximately maybe I'll write it out more carefully so in the following sense so I'm going to assume an even stronger statement which is that it's not only true for the weight w but for all nearby weights and in some coherent fashion Okay so I suppose for all W in W A there is a weight depending on the context C and the weight w such that and first of all what I read earlier works so the effect of prepending this context C former computation as though you're a Transformer at this modified weight on the input stripped of the context and further I want to assume that this very somewhat reasonably let me just say continuously in w okay so there's some mapping for all the points in the phase given this context so we have a picture that looks now like this we have w a we have w a star and there's some other region of weight space which is what you get when you look at the Transformer on any of the weights in this phase now uh why is this somewhat reasonable well if the first line is true then it's true in some neighborhood perhaps right so you can restrict inside the phase and restrict given that approximation until you get something and the kind of rough argument for why this might be true that I gave last time would suggest that this isn't if that's true then this is as reasonable okay note and this is what's interesting or [Music] what's worth paying attention to here is that the map FC

is likely to be highly degenerate as in its meaning okay um why well uh just consider this in the kind of rough picture I was painting last time of the the kind of mother Singularity of the Transformers routing calculations to these different kind of context dependent weights which are these FC of wa Stars so you could imagine that this phase wa it may be able to deal with multiple different contexts right so DN is sitting inside a larger distribution where you might see different kinds of tasks so you might see C underline you might see C underline Prime and if this is the right way to think about in context learning then you would have something like this context the way it's hypothesis for a single wa and multiple different contexts right so uh that means there's much more information around wa than you would expect I mean this this image thing this fcwa star under this intuition just knows how to deal with this particular task right so um there will be many directions another way of saying why it's degenerate is that there will be many directions in weight space going away from wa star which have to do with other problems that have no relevance to see underline in its particular task you would not expect those directions to change FC underline because they're irrelevant so there will be many directions in which the derivatives of sfc underline are zero so that's the statement of degeneracy all right so now let me do a little calculation with the evidence all right so that's the definition [Music] um I'm assuming that they all start with this context that's all right X before yep all right well the hypothesis says that I can replace that by the Transformer somewhere else so now's the interesting part okay so if you look at that last line it looks kind of like the partition function you would Define for integrating over this region here right uh because if you were to Define if you were to compute the free energy to find the partition the model evidence for this region you would integrate dw5w you would take the product over the data samples T of the weight that lives inside that region etc etc um so if FCW were a projection and it's uh Jacobian was something simple then these this basically would be the evidence for that region but I'm arguing that FC should be thought of as highly degenerate so it's not that straightforward so what I can think about here is that uh is that for every every point so if I'm every Point here there's some pre-image so I can think about integrating over this

region so the usual change of variables formula tell you that you would get something that looks like this so w is now ranging over here so it's like this point but to do this integral you also need to integrate over the fibers roughly speaking I mean this doesn't always work of course you can't always do integrals like this but uh his fibers were regular this would be fine maybe they're not so it's more complicated but for today's discussion I think the intuition is fine right so there needs to be some Factor here which basically corresponds to integrating this measure over the fiber and I'm just going to call this Phi bar so I'm going to absorb that in here I don't need to do anything with it uh okay so this kind of looks like the evidence for that region as I said but I've changed this multiplicative factor from Phi to Phi bar now if this were regular it wouldn't contribute to the asymptotics for the same reason that the prior as long as it's positive um at the Singularity that controls the asymptotics it doesn't contribute to the asymptotics in the usual situation we're dealing with in SLT uh but this FC is likely to be zero or to have singularities that contribute to this integral so um the the argument I'm about to make is that this will be less than what I would get by just integrating the posterior over this region sort of on its own so where I take this data set yeah the data set changed as well right so here I'm taking the thing that has the context in it here the data set that I'm considering with respect to the posterior that I'm integrating doesn't have the context in it it's the okay so this is kind of the uh few shot data set and this is the zero shot that is it okay so let me come back to arguing for this briefly this is a bit hard to argue for unless you're familiar with how these integrals actually work but uh what happens when you do these integrals is that you'll you'll get contributions so uh you'll get contributions from the factors that are most degenerate and if as I said the prior is positive it won't contribute but if this FC map has degeneracies they will sort of contribute to how degenerate the thing you were integrating originally was which was the the empirical KL Divergence or the Ln for this zero shot data set in this region fcwa so Phi bar can only make that worse and will make it worse if it's degenerate at this most important Point FC well assuming FCW a star is the most important point is a bit beyond the scope of this seminar but uh but argue by looking at

how that asymptotics will work it's somehow reasonable to think that uh you'll get the generacy you see in that integral without the Phi bar plus some extra contributions from the difference between Phi and fiber which will make it bigger okay um sorry the the more singular it is those will contribute to the K factors that are in the denominator right so that'll make it um smaller so we get a less than or equals to all right so what I I mean let me now sort of try and see what that means and why that might be interesting okay the idea here so let me apply the free energy formula to the left hand side and the right hand side uh and by construction they both have the same energy term is not a proof but it's some idea that the rlct that controls the free energy near W Alpha star is strictly less than the rlct that controls the level set that passes through fcwa star and you think of that as being because that's a more complicated Singularity right so the the idea here is this is a way of maybe getting rigorously given more work at the idea that this routing Singularity that can send queries or a direct predictions to these uh regions of white space that are specially suited for a given sub-distribution or context that has to be more complicated and the difference in the complexities is something to do with the degeneracy of this routing map this FC underline so the difference so if you look in Shaway lens thesis I mean this is basically to do with composing maps and seeing what the singularities do when you compose Maps so it's like a functorial story around the rlct xiaoi Lin's thesis says a little bit about this so this is somewhat reasonable okay it's a measure of the the generacy of this map okay here the intuition is that w a star there's a more complex you need to be a bit careful because the the functions of which their singularities are different right the the one is the few shot data set and the other is the zero shot data set [Music] okay now you don't have to believe the details here to sort of believe the general picture which is that if it's true the idea that contexts get mapped to weights then something like this is the case right I mean if the Bayesian picture is valid then you're doing integrals over weight space and there's good reason to think that if it's true for one parameter the context awaits idea then it's true for some nearby parameters so you will get effects like this that say that the free energy for a given phase for which the context

awaits hypothesis holds that is it directs predictions to some other regions to some other singularities the free energy of that can be thought of as having contributions from the free energy of these sort of sub models these other regions and the degeneracy of the maps that send you from one to the other so that's the high level point and that I think is pretty solid of course there's plenty of cheating going on here in the the argument but the general picture seems very likely to be true okay how am I going for time are there any questions I'll pause here uh it's a little bit more to say but maybe um in physics we introduce fake degrees of freedom all the time we have these ideas like the Hubbard stratanovic transformation or even something like melatonin see where you introduce momentum like if we can't define the transformation on the weights the original weights from Starbucks can we somehow recover it by introducing fake weights extending the weight space that we're looking over yeah that's a very interesting question so uh it's pretty natural to introduce kind of um yeah I haven't yeah it's a good question I haven't thought about in this particular example how that would work I mean something I've played with a little bit is uh introducing uh kind of fermionic degrees of freedom I mean there's a way in which you do some of these kinds of integrals uh in physics by introducing sort of chain complexes by introducing fermionic degrees of freedom so I thought a little bit about what you're suggesting in that direction but not not here uh I don't know why could you could you explain a bit the intuition for why introducing these additional degrees of freedom would make it more true this hypothesis the way that the previous tokens interact with the new tokens is to this non-linear activation that seems difficult to get an exact statement about how to transform attention into boy transformation but if you can insert the weights somewhere else in the attention maybe it gets easier yeah interesting idea yeah I'll have to think about that but thank you any other questions or suggestions okay so I probably have about um 10 more minutes prepared and then we can just have discussion as we did last time so I wanted the let's see uh so I wanted to indicate one way in which this degeneracy I mean the simplest form of this degeneracy and uh elaborate on what happens in in that situation and uh informed by that speculate about what this means for the phase structure of a large language model which is uh

kind of where this is all trying to go right so um unfortunately we've kind of used up these boards so if you'll bear with me we're going to move again to another set of boards which is down in the caves so please follow me don't fall in that hole okay if you can hear me but don't want to walk that's fine if you just reset your character you'll get teleported down here so that's the easy way to get to where we are now all right so I'm going to suppose that we're given uh so the previous discussion was assuming that my data distribution was just all in one context right let me let me now assume that there's two so my my data sample consists of sequences with one prefix and then sequences with it with a different prefix so um so now suppose illustrate that the possible kinds of degeneracy in FC that there are context C1 and C2 and our data set DN looks like some things in one things in another so n100 plus N2 is and that the context awaits hypothesis is true for both of them all right and then let me do this integral again same integral well I'll just get one factor which is a product over the things that use the first context and then another product over the second class of things hey so I'll have a funny situation where this integral looks like it's integrating over two different parts of the weight space right so in the first case I'll get fc1 of w ooming context to weights for both so approximately and the other Factor looks like it's talking about some other region okay so the simplest kind of degeneracy we could imagine in these two maps is that they just they only depend on disjoint sets of the weight variables in some neighborhood of w Alpha star so here's w a here's W I said Alpha w a star let's just consider some coordinates U and some other coordinates V in some coordinate patch around wa star and assume that okay so uh again to roughly justify why this is somewhat reasonable um what are these weights well they're attention weights so they're query key and value weights right and the way I presented it last time there was like frozen entities and unfrozen entities and the Frozen entities were the things that were uh were contributing to this FC map right the nature of these Frozen entities so suppose that for these two different kinds of tasks the Transformer has learned to use different entities to process that information right because there's an entity for cars and there's an entity for colors and you know they're different so the weights that are involved in one of the

specializations fc1 is just different to the set of Weights that are insult in fc2 and in general that's almost certainly not true but so for the sake of illustrating the principles involved I think it's you know reasonable enough okay so we're going to assume that there are local coordinates is that there are local coordinates who and V near wa star such that I'm going to write it in somewhat inaccurate way I'll write it like this what I mean by this is that the partial derivatives and all the V directions of fc1 are zero right the value of fc1 nearby wa star depends only on the U directions and similarly ifc2 depends only on the V directions all right well that's the situation we understand how to deal with in terms of these asymptotics it's covered in [Music] remark 7.2 of the gray book so that means this integral splits up into a product that's the point so this is wa well I'm assuming the prior splits in as well but the prior doesn't really matter for my purposes so that's fine so that's a product of so fc1 it's a view V because W has somehow become UV here but fc1 of UV only depends on you so I have stuff that only depends on you and I've got a factor here that only depends on V alright so this integral is really just uh but now I want to write here for the integration domain what they matter hope you hopefully you get the idea it's that integral times this integral now what happens if you have KL divergences which are a sum and the some hands depend on different directions in the space of parameters then the rlct adds so I won't go through how that works but uh what this will look like is it will uh when we go to the free energy that we have a free energy formula that looks like well what's happening at wa star now this has contributions term here if you think about it as some some of the summons there look like things around fc1 of wa star and some of them come from fc2 of waist and then you'll just have uh two contributions to the rlct one that comes from the singularity at fc1 of wa star and one that comes from the other place right this is like wa star and then there's these two kind of daughter singularities and something's happening near here and that controls the asymptotics of the integral that's that's going on around here so the point I want you to take away from this example is the general idea that you might understand the phases and phase transitions of the mother Singularity so the whole training process for the Transformer I mean under

these hypotheses so first of all there's a back story to how you understand phases and phase transitions into a first approximation that comes down to the free energy formula and a change in the balance between the energy and the entropy terms uh so the free energy formula is what controls phases and phase transitions and what I've sketched here is an argument for how you can try and understand those phases and phase transitions in terms of contributions from each of these daughter singularities so the kind of idea would be to go and look for some description of the overall phase transition system of phase transitions in terms of the following ingredients is where I'll finish so it's a divide and conquer idea right so the idea is that these daughter singularities are simpler their phase structure is easier to understand and the complexity is from the SE many many hundreds of daughter singularities somehow having superpositions of their phase structure altogether to form the the behavior of the overall model of the full model terms of a well the acquisition of a given context is an active thing would be almost certainly an example of a phase transition so there's a phase transition you would suppose where a given sub distribution picks up a context and the model understands what to do with it so we're a given context is recognized okay oh thank you yes is recognized I.E the context the weights hypothesis begins to hold so at the beginning of training this this hypothesis is not going to be true right if it's ever true but at some point maybe it begins to be true for some sub distributions and maybe not all at once so the point in training or the the point at the size of the sample where it becomes true for a given sub-distribution would be this kind of phase transition type A that begins to be valid and then B you'd have phase transitions in these I'll just call them sub models I don't really know they have this they share the same parameter space and potentially the same number of Dimensions right so it's maybe a little hard to say that their sub models strictly speaking but but another ride too but that would be so as I on the previous board that line on the bottom the first term the energy term has some contribution from each of those daughter singularities and the rlct term also does so the free energy kind of adds in this simple circumstance where it splits completely it just adds in general it won't add there'll be some interference um but you can see how from that

description the free energies and therefore the phase structure kind of mergers across these daughter singularities and if you fix all the phases all the daughter singularities but one and that undergoes a phase transition while that will cause probably a phase transition in the overall thing so that's the kind of phase transition that I am talking about in b and then then there would be interactions twin phase transitions type B uh and this this picture has a kind of interpretation in terms of deformation of singularities in which this kind of interaction between the different deformation parameters is Not Unusual okay so maybe I'll just Briefly summarize so uh this is an exploration of what would be true if this idea of uh in context learning uh being a kind of discovery of a different set of weights for a Transformer if that were true if that were true there would be kind of unavoidably some consequences for the nature of the posterior and the free energy along the lines of what I've sketched so I think the the real question here is whether or not question two is true so yeah it seems like a very interesting question to I'll stop there and take questions thank you everybody first yeah I guess Matt was suggesting last time to look for the simplest model in which you could try and discover whether this hypothesis is true that seems quite important and also I mean there's for an algebraic geometer the uh one is always tempted to go to the relative case right so rlcts or bi-rational invariants just of the variety itself of true parameters as soon as you see that you wonder about Maps right so it's it's also quite natural to try and interpret in context learning in terms of maps from parameters to parameters geometrically that's sensible like how in that paper why can Jeep PT learning context how they think of this like into defense updating it's like a dual form of uh what they call like I guess the knee retention but basically you mean like just defining a function that takes The Entity update method to the gradient assignment right yeah yeah hopefully we can figure out a version of that just for some background I guess we promised two seminars in a row to present those papers um uh kind of we first have to find a version of those papers that's that we would feel what is worth our time to present so we're still still doing that and that'll come at some point yeah there probably is probably as some link for some data distributions between in context learning and gradient descent

but yeah I think we want to present it quite differently to the way they're doing it this is maybe just a much broader question but how do you think this links to uh I guess one mechanistic interpretability approaches or you know we're trying to read the thinking of a neural network you have like this logic lens stuff how do we start laying the link between this kind of picture and that yeah it's a good question that I don't really know um I guess uh I guess maybe we have some intuition for for the role of attention in directing computations to different entities right I think many people have seen phenomena like that in NLP and and other kinds of Transformer tasks so I guess my first not very well thought through reaction to that question is suggests that you can decompose interpretability into two parts so one is sort of understanding the routing and the second is understanding how to deal with each subtask that's like but that's probably don't think you would need this picture to have that inside right that seems like a pretty natural thing to try and do anyway I think maybe the so one of the things I'm really interested to look at more closely in some of the more recent work coming out of anthropic and probably elsewhere is the uh the link between phase transitions or something that looks like phase transitions in in various measures of effective dimensionality and the acquisition of kind of structures in the network right and maybe you can match up the phase structure there with some picture that looks like what I'm presenting here and by I mean it's going to be very difficult to argue from I mean the theory is sort of very it's going to be very hard to touch base with any concrete particular networks or representations right but maybe there's a at the level of those phase diagrams there might be actually um like at that level of connection between some theoretical picture like this and uh and the more mechanistic interpretability picture um so it would be of course Very suggestive if you're studying mechanistic interpretability and you see the order in which certain kind of structures are learned and appear and some sort of uh you know generalized susceptibilities which appear to diverge like the rate of change of the effective dimension in this toy models of superposition paper if you if you saw if you saw some sort of structure in those generalized susceptibilities and in the theory you could also uh see that these sorts of things should arise in the training process and then

maybe you could point to measurements of those generalized susceptibilities and even if you can't figure out what's going on inside the network conjecture that something has been acquired right so you could even without an ability to say exactly what it was that it was acquired he might be able to recognize a signature of the acquisition of for example in the context of today's talk acquiring this map that sends a sequence of tokens in context to some other Transformer that phase transition the type a phase transition maybe there's some signature of that that you can detect and you may not know what has been picked up but you can then go looking for it right so maybe that's a way in which these two pictures could mutually reinforce each other I uh kind of a question based on about suggestion last week about finding the smallest model that can perform in context learning um do you expect it to be something as like nice it's like a one layer 10 H Network or would it have to be like a much bigger model or I guess we could just like check right with like a single layer attention Transformer yeah it could be it could be very Universal I don't know I mean the what what supports the ability of the Transformer to compute on arbitrarily many tokens it's just that there's a sum right so in the soft Max there's a sum that's it uh so you you can add as many things as you like to that and you have to reuse the weights in some way right so that's the other thing so there's some sort of weight tying uh but if you have both of those things yeah I don't know I have I have no intuition for how simple it could be before you'd see something interesting I guess you know I don't think people have observed in context learning in simpler systems than Transformers that but that may be to just due to the fact that they're very it's very difficult to train anything else on a sufficiently large data set to see it emerge me you know it's it's not that you'll get in context learning on arbitrary data sets right that's the other interesting thing that I don't know how to think about and which I haven't touched on here it's it's not enough just to have this I mean the way I presented it that something interesting about the data distribution was that it you see sequences of different lengths right but that's not sufficient probably um there's a paper out of deepmind that sort of argues for you need some sort of long-tailed distributions in the truth in order to elicit in context learning um