WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
Deep learning systems have been found to exhibit power law scaling, where test loss decreases with increasing data set size. GPT3 and PALM are examples of Transformer models utilizing this behavior, with scalability and inductive bias enabling improved natural language processing. Scaling laws have been observed in other areas of machine learning and have implications for model selection, with models chosen based on their scaling exponent rather than engineering Lambda.
Power laws in deep learning systems were first observed in 2017, where a paper showed a power law scaling for test loss as a function of data set size. At small scales, no power law behavior was present, but at sufficient scale a power law scaling began to appear with a negative gradient. This is represented by gamma or minus gamma, and the test loss never goes to zero. These power laws have implications for model selection, and could shape the next era of deep learning progress.
Power law behavior in neural networks was first observed in 2017 due to larger datasets and models. Transformer models are the most prominent example of this, with LSTMs having smaller regions of power law scaling. GPT3, released in 2020, is a Transformer model that takes advantage of scaling laws, parallelizable and has an inductive bias. It models data as entities and their relations, and is seen as a breakthrough in natural language processing. Tokens are embedded into a vector space, and each entity is updated by paying attention to other entities, extracting a query, key and value vector. These values are propagated with weights to update the entity.
Scaling laws for Transformers trained on natural language were first explored in a paper, which showed that when scaling up the data set size, compute and parameters need to be scaled appropriately in order to observe powerful behavior. Follow-up papers clarified the proper scope for these scaling laws, with results showing a power law behavior, similar to the hessness at all result. It was found that when given a sequence of tokens, the task is to predict the next word and the relevant loss is the cross entropy on text scraped from the internet. This is a natural language phenomenon and scaling laws have been detected in other areas as well.
Power laws have been observed in many areas of machine learning, such as random forests, text image and image to text tasks, video and mathematics problems. Scaling laws in neural networks indicate emergent capabilities at various scales, such as in-context learning, addition and arithmetic capabilities. Kaplan et al. suggested training on 300 billion tokens and focusing on model size, while Chinchilla's scaling laws re-evaluated this relationship, suggesting that given a fixed amount of compute, it is more optimal to train smaller models on more data. This has led to a new frontier of scaling, where tech companies are investing billions of dollars to collect larger data sets.
Scaling laws have been observed for natural language tasks and other datasets, with exponents of around a half. Google recently published a paper about a class of models called Palm, which is a Transformer with additional features. The largest Palm model had half a trillion parameters and was trained on a variety of tasks, showing that it was beating the average human on many of them. Performance figures showed that the loss behaves smoothly as a function of the other quantities, but individual capabilities often go through phase transitions. An example of this is the three digit edition task, which had zero accuracy until 10 to the 10 parameters, but rapidly approaches perfect performance beyond that.
Recent developments in machine learning have seen the emergence of reasoning capabilities in larger models, such as the 'Chain of Thought' prompting technique which was applied to the Big Bench dataset. This technique was found to significantly improve the performance of models like GPT-3 and PALM, though the connection between the test loss and the generalization error is still unclear. Model selection is changing too, as Transformers can scale better than other models. This means that the scaling exponent (gamma) is more important than Lambda, and models should be chosen based on their scaling exponent rather than engineering Lambda.
Power laws are a common relationship between two quantities where one is proportional to the other raised to a certain power. They have been widely observed in physics, biology and other areas of science. Deep learning theory has been exploring scaling laws since 2017, with implications for model selection. This talk revisits the subject of power laws in deep learning, comparing the Lambda and Gamma exponents, and speculating on what the next era of deep learning progress might look like.
Power laws in deep learning systems were first observed in a 2017 paper titled "Deep Learning Scaling is Predictable Empirically". The paper trained Transformer models and LSTMs on various tasks and observed a power law scaling for test loss as a function of data set size. At small scales there was no power law behavior, but at sufficient scale a power law scaling began to appear with a negative gradient. The test loss never goes to zero, but the slope of the power law is represented by gamma or minus gamma.
Power law behavior in large-scale neural networks was first observed in 2017. This was due to models and datasets being too small to observe it before then. Transformer models were found to have the most power law scaling behavior, with other models such as LSTMs having smaller regions of power law scaling. Variants of the Transformer model have been explored in literature, but the original model is still the most cutting-edge.
GPT3, released by Open AI in 2020, is an important example of how scaling laws can be taken advantage of. It is based on a Transformer architecture, which involves message passing or an attention mechanism similar to the Ising model. GPT3 is parallelizable and has an inductive bias that models data as entities and their relations. It showed the relationship between scaling compute, dataset size, and model size, and is seen as a breakthrough in natural language processing.
GPT3 is a transformer model, which is the same model used since 2017 but scaled up and trained on a large data set. It is known for its emergent capability of in-context learning. Transformer models take in tokens, which are roughly four characters, and embed them into a vector space. Each entity is then updated by paying attention to other entities, extracting a query, key and value vector. These values are propagated with weights, based on the dot product of the key and query vectors, to update the entity.
GPT3 paper showed that providing tokens to the transformer in a context does not change its weights, but in context learning, providing examples of tasks to GPT3 can allow it to perform the task. Follow-up paper showed results of training models at different scales of compute, data set size and model size, and came up with an empirical formula to describe the results. Top left graph showed a power law behavior, similar to the hessness at all result.
Scaling laws for Transformers trained on natural language were first systematically explored in a paper. It showed that when scaling up the data set size, compute and parameters need to be scaled appropriately in order to observe powerful behavior. Subsequent papers clarified the proper scope for these scaling laws. It was shown that when given a sequence of tokens, the task is to predict the next word and the relevant loss is the cross entropy on text scraped from the internet. This is a natural language phenomenon and scaling laws have been detected in other areas as well.
Power laws have been observed in many areas of machine learning, such as random forests, text image and image to text tasks, video and mathematics problems. This phenomenon is now expected to be universal, with some caveats. The Shape of Learning Curves paper catalogs 10-20 papers that observe power law behavior in machine learning. It is closely related to learning theory, and power laws have been observed in statistical mechanics too.
Scaling laws in neural networks are important as they indicate emergent capabilities at various scales, such as in-context learning, addition and arithmetic capabilities. These scaling laws are the engine of AI progress, as they are the core tenant on which everything else depends. They suggest that it is possible to predictably transform energy and raw materials into useful AI progress. This is in contrast to random forests, where emergent capabilities are not observed.
Scaling laws for neural networks have been studied systematically since 2017, when Kaplan et al. suggested training on 300 billion tokens and focusing on model size. Chinchilla's scaling laws re-evaluated this relationship, suggesting that given a fixed amount of compute, it is more optimal to train smaller models on more data. This has led to a new frontier of scaling, where tech companies are investing billions of dollars to collect larger data sets. N is the number of parameters and d is the data set size. The paper found that it is more optimal to scale both.
Scaling laws have been observed for natural language tasks and many other datasets, with exponents of around a half. This is replicated in the GPT-3 paper and is similar to the exponents observed in Kaplanet, which are around 0.0495, roughly an order of magnitude smaller. A table from the same paper shows the optimal number of parameters as a function of compute, with exponents of a half. Different architectures, such as Transformers and multi-layer perceptrons, have been observed to have scaling laws, with the latter having an exponent of a third.
Google recently published a paper about a class of models called Palm, which is a Transformer with additional features. The largest Palm model had half a trillion parameters and was trained on a variety of tasks, including mathematics problems at high school and undergraduate level. Performance figures showed that on many of the tasks, the Palm model was beating the average human. At the same time, the loss behaves smoothly as a function of the other quantities, but individual capabilities often go through phase transitions. An example of this is the three digit edition task, where accuracy was zero up until 10 to the 10 parameters, but rapidly approaches perfect performance beyond that.
Recent developments in machine learning have seen the emergence of reasoning capabilities in larger models. A recent paper discussed the 'Chain of Thought' prompting technique which can improve the performance of Transformer models on tasks which require reasoning or mathematics. This technique was applied to the Big Bench dataset, which is a collection of challenging tasks, and it was found to significantly improve the performance of models like GPT-3 and PALM. The connection between the test loss and the generalization error is still unclear, but it is likely to be related to Gibbs or Bass.
Model selection is changing as models have the ability to scale. Transformers have become popular not because of traditional model selection criteria, but because they can scale better than other models. If models have a constant gamma, then smaller Lambda should be preferred. However, if scaling is possible, then the scaling exponent (gamma) is more important than Lambda. This means that instead of engineering Lambda, it is more important to prefer a class of models with a larger scaling exponent.
Deep learning models are beginning to show the potential to learn more from fewer examples, raising their scaling exponent. This could be achieved through changing the data distribution or training process, such as Chain of Thought prompting. This would enable models to extract more information from a given set of data points than would be possible without reasoning, and could be an efficient way of learning.
so today's talk is going to be about scaling laws or power laws in deep neural networks actually this is a retread or a revisiting of ground that I covered in the very first meta uni event back in January 2021 so the very first talk that was ever getting given at meta uni was on this subject on the subject of power laws and deep learning and what deep learning theory might be for and it feels like about a hundred years ago I have to say in terms of the progress that's been made so I'll be going over some of that more or less in the same way I did uh two years ago but there's been a lot of progress since and that's some of what I'll cover today so you'll see around the world there are various boards with some papers on them and I'll say a little bit about each of the papers given the time constraint I won't be able to go very deeply into any of them so this is more meant to be a kind of entree into a important and fast evolving area of science than a kind of detailed look at any one part of it okay um so let me give a brief overview so I'll start by saying something about what a scaling law is and then I'll give a quick history of scaling laws in deep learning which really starts in 2017 and then I'm going to say something about the uh implications for model selection as a principal mode of doing statistics and that'll come down to a comparison between the rlct or Lambda and the power law exponent which I've been writing as gamma following one another and I'll end with some speculative remarks that I made actually in that talk two years ago and I would argue have now been born out uh and which seemed to me to characterize what the next era of deep learning progress probably looks like so let me start with what is a scaling law you should ask questions at any point obviously well that I can answer in on this just this slide so our scaling law is just a relationship between two quantities that looks like the following right that's a scaling law or power law maybe typically B is zero uh so then we get something like this of course so a function that looks like this is called a power law power laws are very common in many areas of science from modeling uh from models in biology physics is probably the first place that power laws were noticed pervasively there's a whole theory in thermodynamics about why power laws arise from critical phenomena I'm not going to really touch on any of that uh so a parallel or is just a function that looks like this and um power law scaling
is is when such a power law appears in your system and they've people often refer to these power laws and deep learning as scaling laws or neural scaling laws but it's a little different to the way that you would talk about them in physics but I'll follow the Deep learning literature terminology and just refer to these things as scaling laws going forward okay so where were these first noticed in deep learning systems so this is a a graph from a paper the title here somewhere it's the moment it is often not cited but this is the first place that people as far as I know noticed power laws in in deep learning yeah so it's uh deep learning scaling is predictable empirically this is uh hessness at all in 2017. so what they did was they trained some Transformer models also lstms on various tasks mostly natural language if I recall correctly and they observed power or scaling for test loss as a function of data set size okay so that's what you see here and here by generalization error uh what they mean is test loss so cross entropy on a test set so the task generally for these language examples which is most of what I'll talk about today is to predict the next token given some sequence of previous tokens and the tokens are taken from some dictionary for gpt3 there's 50 000 possible tokens so imagine a distribution of these 50 000 tokens the model outputs some distribution you take the cross entropy between that distribution and the truth and that's your loss and model is optimizing that loss to zero and that's what's being graphed here although it's the logarithm of the loss so if we go back to this this first page here so if I take this formula and I take the logarithm of both sides I get this so this is what you'll see in the y-axis in these plots and this is what you'll see on the x-axis right so in the case we just looked at X is the data set size and uh y minus B is the test error so a few remarks following what hasness and all are observed they observed that for at small scale say small data set size there is now power law Behavior that's what you can see in this graph here so you sort of don't get on the power Lord train until sufficient scale and then you start to see Power law scaling and in the examples we'll be interested in gamma is negative so let me just fix this parallel back on the first board and change it to a minus gamma and that means that this is a slope with negative gradient foreign region right the slope here is the gamma or minus gamma now the test loss never goes to zero uh
there's some irreducible error as they call it so you eventually converge to that and that's this B that I've got on the first board right so that's B here is is to do with the what's denoted by irreducible error in that plot so if you subtract this lower bound on the error from your test loss and take the logarithm then for some range of scales you get parallel Behavior that was what was observed well they call it power law of course you could argue that maybe they didn't do enough to try and fit other kinds of functions to this graph and that would be an accurate assessment but their reporting is that they observe a parallel Behavior uh as described now this paper was uh not a lot of attention was paid to this so this is the first that I know of that discussed power law behavior in large-scale neural networks there may have been papers looking at Power law behavior in say one layer networks earlier than that I'm actually not clear on the chronology but this was I believe the first paper to do this at large scale using Transformers and so on foreign so let me see if I want to say any more about that yeah I think it's worth um observing that we can roughly kind of think of this as a graph also of the history of deep learning so why didn't we see Power Lobby Behavior earlier right deep learning the modern Incarnation has been going since 2012 so why is it 2017 that's only five years may seem like a strange question to a mathematician but five years a long time in in the Deep learning universe so why did it take so long to observe these power laws well it's just a matter of scale right so uh up until up until 2017. we were sort of arguably in this era here where our models were too small and our data sets were too small so that we didn't notice power laws and it was inevitable that as we scaled things up we would see Power laws that leaves out the role of Transformers as a particular architecture it's unclear which architectures have power law scaling and which don't lstms in this paper do have also power law scaling Behavior so it isn't restricted to Transformers but you can find more recent literature which shows that many other models including lstms may have power law scaling for some region but it's much smaller than Transformers and even variance on Transformers yeah it's very strange the Transformer model that's being used in Cutting Edge papers is essentially the original one there have been probably at least a thousand papers on variants of the Transformer but it's
like some kind of bitter lesson uh Gladiator battle where all the contestants trying to improve Transformers just fall off the scaling law cliff and can't compete with the original exponent it's kind of funny so to this day Transformers are the still the main thing people are using to take advantage of the scaling laws and we haven't found better models with better scaling laws but more on that in a moment are there any questions about this um this first slide hey guys say a little bit about what the Transformers are or what makes it special what does that matter yeah I can say a bit about what make them special I won't go into the details of the architecture today or maybe I'll do that next week or some other time so Transformers as compared to say a simple dance feed forward Network uh involve you could say message passing or an attention mechanism which is okay you know about the icing model so think about an icing model and how neighboring spins influence a given spin there's a kind of dot product involved right and that's that's kind of the one way of thinking about the basic idea in a Transformer the reason why you might think that okay you might ask is it the attention mechanism or some other detail of Transformers that really makes them work with the scaling laws I don't know the answer to that nobody does as far as I know one of the reasons they're important is that they scale very well with compute so they're very parallelizable and that's one of the crucial things about a Transformer apart from that yeah I think the relevant thing to know is that Transformers are kind of an inductive bias that says you should model your data as being made up of entities that interact so it's a kind of relational model of data based on entities and their relations and that is the key inductive bias in the Transformer and to the extent that you know you can uh whether or not that is actually a good description of how Transformers are modeling the data is is a question for interpretability so I don't don't recommend you buy it too heavily into that idea that it's really about entities and relations that's maybe not the case but that's how people usually understand it and I think there's some truth to that yeah Okay so the next paper I'm going to highlight is the gpt3 paper from open AI this came in 2020 of course gpt2 was before this and then GPT before that uh but gpt3 was the first time you really see an emphasis on this relationship between scaling uh compute data set size and model size with
performance and this relationship wasn't really characterized in detail until a follow-up paper which I'll talk about in a moment um yeah maybe I won't say too much about in context learning just yet I'll come back to that but this is really the highlight of the paper and what they chose to use as the title it's the idea that so this is uh did so what made gpt3 special was it's a transformer in some sense it's the same model that existed back in 2017 I think was when Transformers were introduced it's the same model just scaled up and trained on a large data set so at that time in 2020 175 billion was a big number that's not considered to be a big model anymore but this um well okay that's a bit of a complicated story I guess but uh things have progressed a lot since 2020. what was really new about gpt3 was this emergent capability which was uh in context learning and maybe I will say a few words about that forms the rest of the context um I think the second slide is a bit harder to read so write something here so how does a Transformer work well it takes in some number of tokens so tokens come from some set T of tokens about fifty thousand so think of these as words for example or parts of words so a token is roughly speaking about four characters so including spaces and punctuation and things like that so you take a token maybe Mary had blah blah blah blah um evil you there's a learned embedding of those tokens that discrete set into some Vector space foreign an inner product on that Vector space and every entity or you sort of think about the individual words as entities each entity receives signals from every other entity depending on so let's let's say this is entity E2 let me take that back so once you've embedded you've got a sequence of words you transform them by some learned embedding into some Vector space you call the images of those tokens under that mapping entities and then each entity is updated by paying attention to other entities so E2 Prime is some function of E1 this through e n via something like the dot products it's a bit more complicated than that really from each entity you extract by learned Transformations a query a key and a value what updates each entity are values that are propagated with weights remember this entity EI entity EI is updated by receiving signals from all the other entities the other entities propagate signals to EI based on the dot product of a key Vector that's extracted from the jth entity a query Vector for the identity and a value Vector for the
jth entity so EI Prime is a function of a sum that looks up to normalization like this sum over J e to the q i dot k j v j so passing messages between entities they receive information from all the other entities through multiple rounds so this passing messages happens for some number of rounds and then you take the resulting final entities say E1 final E2 final and then you make a prediction for the next token okay um these learned Transformations that produce the original entity representations from the tokens and that produce queries keys and values from entities those are the weights in the Transformer or at least some of the weights in the Transformer and that's learned by stochastic gradient descent but in context learning which is this emergent ability that happens at scale that happened at the scale of gpt3 but wasn't present at the scale of gpt2 is the fact that within this context providing tokens to the transformer in this context doesn't change its weights but within this context you can provide examples of tasks that you want gpt3 to do so instead of saying Mary had a little lamb you could say uh a b becomes b a a b b a becomes a b b a a b b b becomes bbba and that's a sequence of examples of a reverse string operation and if those examples occur in this context and then you provide as the last piece of the context a new string to reverse gpt3 I've done this many times you may be unlucky if you try it but generally speaking gpd3 will get it and perform that task so that's what they mean by in context learning and that's one of the key examples of a capability that emerges at scale any questions about this last discussion all right so we'll move on to the next paper which is at another place so follow me please so as I said the they noticed in the gpt3 paper uh well obviously they knew all of this these papers are published very close to each other uh but it was only published in the gpt3 paper that they trained many scales of gbt3 different sizes of model and then compared them and you could see a kind of power law Behavior which was described there but it was in this follow-up paper that they published much more complete results where they trained the models at many different scales of compute that is training time uh data set size and model size and tried to come up with an empirical formula which described the results of doing that here you can see it's a bit hard to read but this top left graph is uh okay actually it's the middle one that is similar to the hessness at all result
right so the middle one is a graph of it says test loss but it's really the logarithm of test loss and test loss minus some uh some number against the logarithm of the data set size that is these these axes are logarithmic scale and you see that there's a straight line which is what we also saw in that earlier plot okay but they also noticed uh similar parallel behavior for graphing um against compute this is just the number of cycles that the if you add up all the Cycles executed in all the gpus that are training the model there's a power law Behavior with respect to that as well and also with respect to the number of parameters now there are details here so it's uh when you're scaling up the data set size you also need to be scaling up compute and parameters appropriately in order to see this powerful Behavior I don't want to get too much into the precise details of how they did it in this paper because unfortunately this paper is somehow wrong in a subtle way that was later clarified by a paper we'll talk about in a moment out of deepmind so this was the first paper to really systematically explore scaling laws but it's not the definitive presentation of the scaling laws for Transformers unfortunately it's you know it's quite a complicated business and it was a little bit off what they presented here okay um and just think for a moment so we observe scaling laws but for what well for Transformers trained on natural language what do I mean by trained on natural language and what is the task well I kind of outlined it briefly just a moment ago so you're given a sequence of tokens for example a sequence of parts of words and your task is to predict the next word and the relevant loss there is the cross entropy on what text well you scrape a large amount of natural language in English perhaps primarily but uh but also many other languages with with the less smaller population from internet sources from Wikipedia from a large Corpus of books etc etc so you take many many tokens of text from the internet and you're trained it on this simple task and You observe these power laws now you could say looking at that that this is a natural language phenomenon so the question that remains in 2020 is what is the proper scope for these scaling laws and that has been clarified uh in subsequent years and this is one of the good examples I'll say more about other areas that scaling laws have been detected in a moment but you can see later in 2020 they published this paper which is uh showing
also scaling law behavior for the same kind of models some small tweaks on images that small scale here but this has been observed in larger images subsequently on text image and image to text tasks on video and on mathematics problems so this uh starts to look like a kind of universal phenomenon across many distributions of data and since then people have so Plus reinforcement learning tasks protein modeling so people have turned Transformers to large data sets of proteins described by DNA sequences and the associated data is various biologic information about the proteins chemistry what else we'll talk about some other examples that are along the lines of math problems a bit later but yeah this this is like a huge industry now finding scaling laws in in many different areas and at this point it's more of a surprise if you don't find a scaling law in the sense that you go looking for the things that are wrong with your model or wrong with your data set and you fix them and then you find a scaling law I know a few papers like that so at this point it's expected to be a fairly Universal Phenomenon with some caveats which I'll I'll come back to any questions about these two papers okay so what was wrong what was wrong with the first one uh yeah I'll say a bit more about that when I you know we come to the paper that fixes it yeah basically they got the exponents wrong uh because they were not training their models sufficiently oh this is actually that paper yeah let me think if I want to say something about um I'm just going to skip ahead in these slides a little bit so let's come back to this just come over here for a moment please okay so you could ask the question okay well there's power laws so what right so what if the test loss behaves as a power law with respect to these various quantities um one thing to note is that power law Behavior isn't this isn't the first time power laws have been observed in in machine learning far from it so uh power laws have been observed in in many parts of statistical mechanics and it's closely related to learning theory so a paper I recommend on that topic that we've discussed on the Discord is the the shape of learning curves which is a survey 2021 about veering and Luke Blues so they catalog maybe it's like 10 or 20 papers that find scaling law Behavior power law behavior and machine learning from random forests to uh yeah there's a there's a whole slew of different kinds of models so what's the big deal then uh we we find scaling more behavior in
neural networks it's also in many other kinds of machine learning why be so excited about this particular scaling law well I would give two reasons one is we actually care so well there's scaling laws in random forests random forests have been around forever does the world seem likely to change on the basis of scaling laws in random forests no why then do we see a qualitative difference in the importance of scaling laws for neural networks well it's because as you follow the scaling law there are emergent capabilities at various scales which are sort of hidden underneath the smooth progression of the scaling law so it's the connection between the scaling or behavior which suggests that well the scaling law Behavior says you can predictably improve the test loss and empirically it's observed that as you lower the test loss qualitatively new and interesting capabilities emerge in these models without being put in by hand the most important example is the in-context learning that I already described this paper here details several other examples for example the gpt3 paper was documented our addition and other arithmetic capabilities emerge at a particular scale and there's other examples along those lines in this paper so the scaling laws for neural networks are sort of a uh an indicator and related to the emergent of emergence of capabilities which is really something new you don't get emergent capabilities in very large random forests as far as I know the second reason that scaling laws are important is that these scaling laws are already the engine of AI progress so if you think of the control engine that drives the field forward as being simply collecting more data and more gpus and feeding it to train bigger models it's not like the engine by itself is sufficient to make a machine right you need to direct that heat you need funnel it you need to control it there's all sorts of equipment and machinery and piping that's required to feed the raw materials to the engine and to direct the heat usefully after it's generated and that's happening and so it's not like scaling by itself is sufficient just like heat just like the fire isn't sufficient in an engine to do the work but it's the the core tenant on which everything else depends and the scaling laws are quickly becoming the core of of much of what's interesting in AI progress and so for that reason they're um it's the it's the emergence of a universal principle that suggests that it's possible to predictably transform energy and
chips into intelligence and that puts us in a position that's really unlike where we were before 2017. okay so that's kind of why to care about scaling laws let's go back now to the chinchilla paper and I'll um sorry to drag you back to the previous set of boards okay so this is the chinchilla paper so this is a deep mind redoing the Kaplan at all paper that we just looked at which was the first to really systematically study scaling laws for neural networks they did it a bit more thoroughly and came up with different conclusions and their conclusions were that let me see if I can underline it in the abstract yeah so what the Kaplan at all paper said in 2020 in which subsequently directed billions of dollars of investment from large tech companies all over the world was that you should probably train on about 300 billion tokens and focus on scaling up the size of your models just the number of parameters so people did that and they went up to half a trillion or even higher uh but the chinchilla scaling laws re-evaluate the relationship between given a fixed amount of flops amount of compute the optimal way of allocating it between model size and data set size so uh that means that it's it's more optimal to train smaller models given a fixed amount of compute it's more optimal to train smaller models on more data significantly more data so since this paper was published in 2022 which is still this year it's hard to believe that was in March now everybody is rushing to collect more data so this is the kind of new frontier of scaling is getting larger and larger data sets and yeah that's kind of an interesting change of course you maybe you don't want to scale both but for the moment we're constrained we can train models at you know around a trillion parameters but we don't have the data to train them optimally yet okay let me see I want to actually write down some formulas um I don't think yeah okay let me write down the actual scaling oh no it's here it's on the next board okay what's very confusing for us about these papers is that uh for them n is the number of parameters which is our d and there d luckily it's capital is uh data set size which is kind of our n all right so keep that in mind uh yeah okay maybe it's not important to get bogged down in this I think I can explain it but uh the important thing for us from the perspective of what I was explaining earlier is to just note on this final slide that what they end up finding is that let's talk about as a
function of n they find that l as a function of n taking the optimal data set size looks like a on N to the alpha where Alpha is a half it's not exactly stated in the paper but you can just infer it from the way they write other things so there's a scaling law for the test loss as a function of the number of parameters which has an exponent of a half now I don't really understand how to compare that to the gpt3 paper to be honest so the scaling laws for the same kind of quantity in kaplanet all have exponents that look like 0.0495 so roughly an order of magnitude smaller I do not understand what the uh how to reconcile that conflict but this this one half seems to be very strongly replicated now across not only natural language tasks but many other data sets okay let's continue with um some of this material over here okay so this is a paper this is a table from the same paper uh these numbers in the figures aren't uh exactly the exponent that I was talking about earlier uh they're talking about the optimal number of parameters as a function of compute which uh is closely related to that parameter and there's a there's an A and B in this table the First Column is a and the second column is B in their notation the exponents we were talking about earlier Alpha and beta but the relationship between them means that if a and b are roughly a half then Alpha and beta are also roughly a half so hence the statement I made on the previous previous board okay just to give you an idea of these scaling exponents for different models and this is a maybe this was published and follow it up but this was looking at different architectures both architectures of Transformers and also different kinds of architectures that aren't Transformers and seeing if they have scaling laws and if so what the coefficients are and you can see here they they also get similar coefficient about a half for the original Transformer you should look at the first column but the the performer is one of these more compute efficient Transformers that people invented because Transformers are sort of quadratic time complexity and the number of entities which doesn't have a very favorable exponent and you can see down the bottom that there's some multi-layer perceptron that only has a scaling exponent of a third but it's useful to know that these non-transformers also have scaling law Behavior although the story is a bit more complicated than that the scaling laws maybe don't last as long along the x-axis as they do for
Transformers okay but you don't see numbers much bigger than a half here so that's the next kind of topic I want to get onto one more set of results so this was yeah quite recently so in April Google published a paper about a class of models they call Palm uh I won't say much about the architecture it's a Transformer roughly speaking with some kind of additional features uh the main thing about this is that they they tested this at very large scale not only in the size of the model but also the number of tasks so the largest Palm model was half a trillion parameters and they trained it on not just natural language but a very wide variety of tasks uh and here you can see some performance figures for 150 of the tasks and the scaling law well it's what's presented here is not exactly the it's not the log loss on the y-axis it's um it's uh some other metric of performance but you can see that on uh many of the tasks the the Palm model is beating the average human maybe one shouldn't read too much into that necessarily but that's an indication that it's not uh not pathetic um and some of these tasks really are quite interesting so some of them are data sets of mathematics problems at high school and also undergraduate level and I'll show you some examples in a moment hopefully okay uh yeah this way so now I'm going to talk about uh so this paper is from anthropic it's a safety AI safety focused organization which is talking about this emergent capability business that I mentioned earlier so at the same times you have smooth General scaling in the sense that the loss behaves very smoothly as a function of these other quantities say compute underneath that individual capabilities often go through phase transitions so they maybe the model really doesn't understand how to add or do some other task and then suddenly it rapidly transitions to to understanding it and so the smooth progression of the overall loss is kind of an aggregate of phase transitions on on many individual subtasks although there's some debate about whether they're really sharp transitions or whether you're just measuring them incorrectly but for the purposes of today let's just stick with a simple narrative and three-digit Edition is kind of the standard example so the accuracy of gpt3 or gopher or Google um I think that's the Palm model perhaps uh on three digit Edition is basically zero up until about 10 to the 10 parameters and then rapidly approaches perfect performance as you scale beyond that
program synthesis is another example that's along those lines okay so you could look at these kinds of tasks and say that it's something like reasoning that is being picked up as you scale up models and at the part you know maybe that was a bit speculative at the time of these results uh but that's becoming clearer and clearer that some kind of reasoning capability is um is emerging at a sufficient scale all right and down here for the last set of boards thank you so one of the sets of tasks that were that the Palm model was trained on was called Big bench and this was meant to be a wide variety of tasks that would be challenging for current large models uh as often happens these challenges are falling rather rapidly and the very interesting recent I don't know when this paper was from but I think it's within a month uh it's a very strange thing that people have noticed which is a falls under the basket of emergent capabilities which is what this cot graph which is Chain of Thought so Chain of Thought is a means of improving the performance of Transformer models by adding into their context examples of thinking through a problem step by step and then encouraging them to solve the problem that they're given in a similar fashion so that's called Chain of Thought prompting and it improves the performance of models like gpt3 or Palm significantly on tasks that are heavy on reasoning or mathematics these kinds of things so you can see here for instance that the the average score on the big bench hard BBH which is the subset of tasks in uh big bench for which palm I believe this is correct for which palm did not outperform the average human reader I think there's something like 20 or 23 tasks um it's still fairly preliminary but ah this is a very interesting relationship between I mean uh you see not only does it improve the performance of palm but the larger the model is the more it improves all right which brings me to the final part of the talk let me see what's on these other boards maybe I'll skip over it yeah I don't need to cover this okay so a little bit of theory to finish off any questions first about what we've covered so far [Applause] okay so I said I'd say something about uh okay there we go uh Lambda versus gamma okay so I'm I'm using the notation from SLT the connection between what we're talking about in SLT remains unclear the test loss is perhaps to be compared to the generalization error either Gibbs or Bass probably Gibbs but this connection is not established
similarly we don't really know how to talk about scaling with respect to anything other than data set size and even this connection is uh maybe a little shaky the number of uh the the size of a sample which is our n in SLT may be similar to the number of tokens perhaps seen during training but perhaps not so I don't want to overstate how tight this connection is but I think it's still useful to Think Through using the tools of SLT what all this means and for that purpose just indulge me in in using these letters and sort of relying on their meaning in the SLT context so suppose test loss or generalization error looks like um L minus l0 is Lambda on I'll use the n into the gamma for some lambder and gamma so this would be something like the rlct and this would be some exponent which is uh one in the realizable case or more generally in the renormalizable case all right well so as I said before gives us a relationship that looks like this that means that if you can affect gamma it's more important than Lambda all other things being equal why well because a line which has a steeper slope will eventually beat a line with a smaller slope even if it starts higher right so model selection as it's currently understood translates to preferring assuming that the error is zero models which are smaller right simpler Lambda smallest free energy lower that's model selection between competing models you can pay their evidence or free energy and that prefers models with a smaller rlct the rlct the logged Lambda term kind of looks something like this y-intercept if you go back far enough but uh that's only under the circumstance where gamma is constant if all your models have gamma equal to one then yes you should prefer a model with a smaller Lambda but if you're in a scaling regime and you can make n large enough to get the lines to cross and you have the ability to affect gamma then this is a more important quantity and you should prefer a class of models that has a larger scaling exponent rather than trying to engineer Lambda it's a the primary thing so this is a sort of New Era of model selection in some sense which we're we're seeing play out around us right now yeah so Transformers themselves one not because there were some uh comparison really in Old School model selection terms but because they could scale and other models couldn't and now we're starting to move into an era where maybe it looks like it's actually possible to engineer gamma which is the the really radical uh
possibility so we've seen a scaling exponent of a half for Transformer models and uh up till now in terms of typical deep learning practice where you change the architecture it's not clear how to change the architecture to improve the scanning exponent but it seems possible that we're witnessing the beginning of a practice of doing exactly this not by changing the architecture exactly although maybe that will happen too but by changing the data distribution or training process for example this Chain of Thought prompting now uh this this may turn out to not be a systematic way of improving scaling exponents and right now it's not something that's kind of baked into the model itself it's something you do at the end so you train the models a gpt3 and then you somehow encourage it to do more with less by putting in something in its prompt so some examples that encourage it to to uh take things step by step but you could imagine a future training Paradigm where this is built in to the training process so you are providing training examples during the not just at the end in the context but as part of the training procedure and if that happens and works then we may see that there are methods for increasing the scaling exponent and uh yeah what do I want to say about that well maybe maybe let's go back to this formula on the previous board so uh let me take this and differentiate it with respect to n so what is what is the scaling exponent so the scaling exponent is roughly something like how much you learn from each example you know scaled to the number of examples something like that which is a pretty good description I mean increasing the scaling exponent from this point of view is a pretty good description of what reasoning actually is right so uh well let's take a meta step and say why have deep learning theory well I have Theory at all why do you want a theory of thermodynamics well it saves you from doing experiments right if you have a theory and you can do a linear regression you only need to do three experiments and you can infer an infinite number of other experiments from those three by just fitting a straight line to the data right so in some sense being able to reason is a way of extracting a lot more information from a given set of data points than you would be able to without the reasoning and uh it it seems that we're starting to understand how to get deep learning models to do something like that or at least these connections are quite tantalizing so this is