WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


OpenAI's GPT4 is a powerful AI language model that can interpret images, tokenize them and use auto-catalysis to fine-tune itself. It's capable of making sensible predictions and has been integrated into cars for a Knight Rider-like experience. However, it needs further post-training steps to improve its performance and better reasoning skills. Institutions such as Morgan Stanley are exploring the use of AI bots to surface important information and help prevent costly mistakes.

Short Summary


OpenAI's GPT models have been instrumental in advancing the field of Deep Learning, as each model scales according to power laws. GPT1 was basic, GPT2 was multi-tasked and trained on 40GB of text from the internet, while GPT3 was the first to use billions of parameters and was trained on a large data set. A two-dimensional chart can be used to visually represent the evolutionary complexity of the models as they scale, and a visual infographic could be used to help people understand the progress made in the field.
GPT3 was the first Transformer model to use scaling laws and was released in 2022 with Chat GPT. GPT-3.5 was more tailored for chatbot use and GPT4 is significantly smaller and cheaper but can handle a longer context window. The basic research involved is unknown, but it is thought to use a similar corpus and is believed to follow scaling laws. However, it is unclear how confident we are that the scaling is continuing without knowing the specifics of the training set size and the number of parameters.
GPT-4 is an AI model with more stable and predictable training than previous models, allowing it to predict its performance. The scaling laws used to measure its performance are kept secret, raising concerns about transparency and equitability. Despite this secrecy, the headline features of the new technology have been met with no surprises, leading to improved exam performance but also some fear of the lack of surprises.
GPT-4 is a language model used to interpret images and convert them into text. It tokenizes images into patches and maps them to tokens, allowing the model to receive text, robot arm measurements and images from cameras. GPT-4 performs better on reasoning tasks than Minerva, and has the capability of auto-catalysis, meaning it can produce its own data to fine-tune itself. Training large language models requires a lot of trial and error to get the right parameters and hyperparameters, known as the scaling law. Meta Facebook's Opt model has a public record of its training process, and modern deep learning is guided by the scaling law to ensure the model works correctly.
GPT4 is an advanced engineering model with a maximum context size of 32,000, eliminating the need for a consultant. It is unclear how much of its features are public, but the long context research is one of its most impressive features. Training to 50 pages is more expensive than training to 5 pages, and GPT4 is expensive compared to its predecessors, costing 12 cents per 1000 tokens. Auto-catalysis is a process where one of the products of a reaction catalyzes it, and GPT4 is being used in this way.
OpenAI has been successful in partnering with other companies to use their AI tools. Microsoft and Salesforce have integrated chat GPT, while Khan Academy and Duolingo have released or demoed tutor bots. GPT4 has a final training stage, RBRM, which uses the model to make sure outputs are consistent with intentions. Intel and Google are already using AI systems to layout chips and write code, and this is likely to become more prevalent in the coming years. OpenAI has also collaborated with Bane to use their tools to analyze PDFs and other documents.
AI is demonstrating an emergent ability to explain and guide others in reaching the right answer, similar to the way humans learn, teach and self-reflect. GPT-4 is capable of making sensible predictions about common sense physics and other real-world situations, and has been integrated into cars for a Knight Rider-like experience. However, the current systems are fragile and can be easily disrupted, although these fragilities can improve over time. AI continues to learn from digital exhaust and has improved over time, although it is still dependent on training sets.
OpenAI's GPT Force was challenged with a situation involving a driver who stopped their car, got out and did not return until the next day, when they were unable to drive away and had to call for roadside assistance. With better prompting, the AI was able to lay out more possibilities, eliminating some and providing more thorough explanations. The speaker discussed the importance of having good habits when it comes to reasoning and logical thinking, using the example of the driver to illustrate his point. He showed how prompting the AI to question its own assumptions led to it being able to identify the scenario the speaker had in mind.
GPT4 has the potential to make inferences and respond to prompts, but needs further post-training steps to improve its performance. This prompted the speaker to realise the importance of cognitive and intellectual hygiene. For institutions such as Morgan Stanley, an ecosystem of reasoning bots could be developed to surface important information and help prevent costly mistakes. This could involve the bots churning through documents, writing essays and critiquing them.

Long Summary


GPT1 was largely ignored by the Deep Learning Community, as it was seen as a basic Transformer with only 4.5GB of text from books. GPT2 was more successful, as it was a multi-task model trained on 40GB of text from the internet. GPT3 was the first model to use billions of parameters and was also trained on a large data set. It was hailed as novel due to its unsupervised training on a wide range of tasks, which was seen as a major contributor to generalization.
A three-dimensional frontier of scaling is suggested, composed of the size of the model, the size of the training set and the compute used. A two-dimensional chart could be used to visually represent a non-linear phenomena, such as evolutionary complexity, with key milestones along the way. OpenAI was cautious about releasing source code, but the lecturer argued that the technique was not that much of an advance.
It would be useful to have a visual summary of the emergent capabilities of Transformer-based models as they scale according to power laws. This would help the general public to understand the progress made in the field. Visualizations of model sizes exist, but they are often too technical for the average person to understand. A visual infographic that ties together the power laws scaling laws with the capabilities that emerge at each level of scale would be helpful for people to comprehend.
GPT3 was trained on a 570GB corpus in 2020 and is capable of context learning without changing its parameters. This was the first time scaling laws were clarified in the context of Transformer models and many different scales. GPT3 was in stealth mode for some time before it was publicly released in 2022 with Chat GPT. This gap was likely due to business model problems rather than technical problems. OpenAI's scientific team was making progress but the business side didn't know what to do with GPT3.
GPT-3 was released with a playground, but it did not gain much traction. It was a powerful model, but it required a lot of engineering to make it useful for chatbots. This led to the development of GPT-3.5, which was more tuned for chatbot use and could follow instructions better. This was used to create chat GPT, which was good enough to be put in front of people and found useful. This was more of a business or social phenomena than a technical one.
GPT4 is significantly smaller than GPT3.5 and cheaper per token, but it can handle an eight times longer context window. The degree to which the basic research involved is secret is unknown, but it is thought to use a similar Corpus. It has managed to follow scaling laws, although the Deep Learning literature is becoming increasingly cagey about giving information. It is unclear how confident we are that the scaling is continuing without knowing the specifics of the training set size and the number of parameters.
GPT-4 is an AI model trained to predict the next token in the OpenAI code base. It is claimed to have a more stable and predictable training than previous models, with the ability to predict how well it would do at a given point. The scaling laws used to measure its performance appear to be working, although the exact parameters used are not known. This cartoon-like graph shows the model's performance on downstream tasks, which are not the true training objective. The engineering objective of GPT-4 was to have more stable and predictable training, with the ability to predict its performance.
The latest iteration of AI technology has been met with secrecy, with the number of parameters and scaling laws being protected as National secrets. This secrecy has been met with concern, as transparency and openness are necessary to ensure alignment and equitability. The headline features of the new technology have been met with no surprises, meaning the scaling laws have continued to translate into capability. This has led to improved exam performance, but the lack of surprises has been deemed terrifying.
Multimodal capability has enabled machine learning algorithms to be used on real world examples. GPT-4 has achieved high scores on the SATs, AP exams and the bar exam. It is able to answer questions based on images of exam papers. This is not a new concept, but its integration into a system of this scale is. It is not clear how many tokens per image are used, but it appears that the model is interpreting the images natively rather than using a separate tool to convert them.
GPT-4 is a language model used for rapidly scanning images and converting them into text. It tokenizes images by breaking them into patches and mapping them to tokens. This allows the model to receive tokens that represent text, robot arm measurements, and images from cameras, all embedded into a common language space. GPT-4 performs better on reasoning tasks than Minerva, a model fine-tuned on math questions. It also has the capability of auto-catalysis, meaning it can produce its own data to fine-tune itself.
Training large language models requires a lot of trial and error to get the right parameters and hyperparameters. This is known as the scaling law, and if it is not followed correctly, the model can crash. Meta Facebook's Opt model has a public record of how it was trained, which can be used to understand the process better. Stability is also an important factor, and modern deep learning training is guided by the scaling law to ensure that the model works correctly.
GPT4 has improved engineering that allows for a much larger context size than previous models, with 32,000 being the current maximum. This is a significant advance and may mean the end of needing a consultant. It is unclear how much of this is secret and how much is public, however the long context research is one of the most striking features. It is possible that there is a workaround to allow for this larger context size, but it is likely a serious improvement. The cost of training to 50 pages is likely much higher than training to 5 pages, indicating the computational complexity of the model has increased.
GPT4 is an expensive product, costing 12 cents per 1000 tokens, compared to 6 cents per 1000 tokens for GPT3, and 0.12 cents per 1000 tokens for Chat GPT. Price and cost do not always correlate, as seen in other industries such as energy, where monopolies can cause a huge disconnect between cost and price. Auto-catalysis refers to catalysis of a reaction by one of its products, and GPT4 is being used in this way.
GPT4 has a final training stage, RBRM, which uses the model itself to predict whether the trained model is following guidelines. This is a way of reflecting on the outputs to make sure they are consistent with intentions and instructions. Intel and Google are already using AI systems, including language models, to layout chips and write code. This integration of AI tools into the production of the next generation is likely to become more prevalent in the next couple of years.
OpenAI has been successful in marketing their technology and making collaborations with other companies. Microsoft was seen as looking smart and innovative, while Google was left behind. OpenAI has collaborated with Bane, a consulting company, to use their tools to analyze PDFs and other documents. Salesforce and Slack have also integrated chat GPT, while Khan Academy and Duolingo have released or demoed tutor bots. The GPT4 webpage also had a Socratic dialogue model which was able to diagnose and guide the user to the correct answer.
AI is demonstrating an emergent ability to explain why it gets the right answer and guide others in reaching the right answer, similar to the way humans learn, teach and self-reflect. This is reminiscent of ancient ideas of apprenticeship and mastery. The similarities between AI and humans are striking and unnerving. It is difficult to determine how much of this is due to the training set, which is trained on a huge amount of digital exhausts.
AI is able to learn from digital exhaust, and has improved over time. Humans are largely dependent on enculturation and education, and AI is no different. AI, such as GPT-4, is capable of making sensible predictions about common sense physics and other real-world situations, which has implications for self-driving cars. However, GPT-4 cannot be put directly inside a car.
GM has announced a partnership with Planet to integrate GPT 4 into cars, turning them into a Knight Rider-like experience. GPT 4 can be used as a personal assistant in vehicles, although the self-driving aspect will be separate. However, the current systems are fragile and can be easily disrupted. This was demonstrated by increasing the amount of context the bots had, which caused them to emulate each other and forget their identity. Despite this, GPT 4 is still very good at following instructions and these fragilities can improve over time.
GPT4 has improved significantly in terms of writing, reasoning, thinking, answering questions and understanding the world. It is noticeably better than its predecessors, GPT2 and GPT3, and is capable of producing and explaining code in Python. The speaker was impressed by the improvements and found the experience to be quite stunning. It is an exciting development, as the technology is set to become more advanced and useful in the future.
A driver stopped their car, got out, and did not return until the next day, when they were unable to drive away and had to call for roadside assistance. OpenAI's GPT Force was challenged with this situation and offered some reasonable points about the environment, motives of the driver and condition of the vehicle. With better prompting, the results improved and it was able to lay out more possibilities, eliminating some and providing more thorough explanations.
The speaker discussed the utility of having good habits when it comes to reasoning and logical thinking. He gave an example of how he asked an AI to explain an event in which a driver got out of the car and had to disappear, leaving the car running and it ran out of gas when he returned the next day. After prompting the AI to question its own assumptions, it was able to identify the scenario the speaker had in mind. The takeaway lesson was that it is important to have good habits when it comes to thinking and reasoning.
GPT4 is able to make inferences, yet the speaker was surprised that it wasn't already being done before the system interacted with them. Prompting the system further improved its performance, leading the speaker to wonder if this was just additional good hygiene or something else. The speaker also noticed that GPT4 did not have an internal thought process and that it just started talking immediately after being asked a question, rather than thinking before speaking. This made the speaker realise the importance of good habits of cognitive and intellectual hygiene.
GPT4 has the potential for greater performance with post-training steps such as recursion, where the model produces a response and feeds it back into its own prompt, running all the necessary hygiene steps. This could improve the result substantially, though the improvement may be less than with GPT3. For institutions such as Morgan Stanley, an ecosystem of reasoning bots should be developed, churning through documents, writing essays and critiquing them, to surface important information to VPs. This could help prevent costly mistakes and help identify risks that have been overlooked.
Silicon Valley Bank failed due to a lack of attention, competence and relevant knowledge. Superintelligence may emerge due to an ability to pay attention to a lot of information, leading to things becoming obvious to systems with unbounded attention that are not obvious to humans. Microsoft has built an open AI supercomputer in a data center over the last few years, although the specifics are not public information.
Microsoft and Nvidia recently announced the installation of tens of thousands of A100s in a Microsoft Data Center, estimated to cost around 200-300 million dollars. This is significantly less than the 500 million dollars spent in 1945 to build the Oak Ridge gaseous diffusion plant for the Manhattan Project. Investments in scientific directions can quickly change from minimal to dramatic, and it is unclear what the trajectory of investments in GPT models will be in the future.

Raw Transcript


uh okay uh Adam's only got an hour today because of the time difference so I suggest we would get straight into it um all right so we're going to talk today about gpt4 um I guess I thought I might start off by giving a brief recap of how we got to here just taking a few minutes to talk about gpt1 gpt2 gpt3 so I'm going to put something about that up on the boards you might take a look at these two pictures that I've got over there and guess what they are the immediate guesses I'm gonna guess GPT 3 and GPT forms and one of them is old very old yeah it's not the Manhattan Project I was going to say the one on the right looks like a like it might be a 1950s 60s 70s sort of data center kind of place some some something like that like the turing machine Maybe [Music] about it I'll I'll tell you in a minute okay so here's a recap of uh not that which is kind of history or the thing on the right is history I think on the left is the future but here's a bit of recent history so GPT one was basically ignored by everybody and derided by the Deep Learning Community as being boring and containing nothing new it's just a Transformer we know what those are all you did was kind of train one that looks big these days it's hilarious that it's not even written in millions of course right so now you write the number of parameters in billions at the time whoa that was a huge model um and it was trained on a huge data set which was four and a half gigabytes so this was of text from books and so on that was in 2018 uh gpt2 was 1.5 billion parameters maybe I'll put some headings here this is parameters this is training set and self-explanatory so this was trained on the bigger Corpus of about 40 gigabytes of text scraped from the internet uh there was a bit more attention to gpt2 sort of the headline each of these sort of came with a an idea of what people what the developers thought was a novel about it and gpt2 was hailed as novel because it was a multi-task model and that was the idea that and the quote from the paper single task training on a single domain data set is a major contributor to the lack of generalization uh and the idea here was to train simultaneously on many different tasks by training on a very wide data set in an unsupervised fashion at the time that was a fairly novel at least the scale was normal I think GPT 3 was the first time yeah nice before um for moving on to gpt3 I remember about gpd2 because when this happened I was taking the natural language processing class as in my masters and I
remember the lecturer who's a prominent researcher in the field said well because they they um they didn't release the source code or something like that they they kind of described the technique but they were a little bit Coy about the details or something I can't remember exactly what happened and my lecture was saying that's a bit silly because they uh they're not doing anything that new and it's not that like it's that much of an advance over state of the art but at the time obviously open AI was like at least at least being cautious about that kind of thing I guess yeah for maybe maybe reasons that have changed over the years I guess it used the justification used to be well has always been partially safety I guess but I think there were gpt2 I don't remember there being any details really mysterious do you remember what exactly that was I don't remember yeah okay this seeing you put sketch it out here like this makes me imagine a uh a frontier a three-dimensional Frontier expanding Frontier where the three dimensions are the dimensions along which you can have scaling I suppose we talked about this in the past and maybe it's really only the two but basically you have you have um uh you have the the what is it the size of the model the training size of the training set and the compute that that go into it if I'm remembering correctly those are sort of the three elements maybe I'm wrong about that um that are subject to this the the the quote unquote scaling laws is that correct sorry that's why I may have that those details those details incorrect but what I'm what I think would be really interesting to see um is something analogous to the two dimensional like we could do it in three dimensions but maybe two would be enough but I've seen um a very useful infographics so charts that have sort of a trajectory sometimes it'll be log logarithmic you know to to visually um represent in in a you know a linear visualization of a non-linear phenomena and then a stacked on that on you know the the Curve are key benchmarks or thresholds or Milestones along the way things that were achieved and I've seen this for a whole variety of things lots of different processes lots of different phenomena the one that's in my mind um that I'm thinking prominently of is evolution and biological complexity and and and the the um uh the points at which new capabilities and features emerged uh in evolutionary space not necessarily at evolutionary time but just in in space in terms of the complexity and I'm
wondering it would be really neat to see a sort of a a visual summary of the emergent capability that manifests in these systems as they got bigger and and and as they well basically as they scaled according to this laws um it would be uh I think that would be a very useful um tool for the for everybody including the general public to kind of get a sense of of okay well where Where Have We come as we've gone through these these iterations and I don't necessarily just mean the GPT specifically I mean there are many other Transformer based models to my knowledge and it would be neat to see sort of as many as possible all visualized um together and and with a with some sort of visual summary communicating the the the scales at which certain capabilities begin to emerge I think that'd be really really useful um if the data were available I could certainly get my team to to help put that that uh visualization together and maybe that would be something worth publishing you have seen visualizations of sizes of models and so on but they tend to be kind of inside a baseball a little bit because it's not when you look at those graphs you kind of have to be pretty familiar with many many different pieces of work in the literature to understand how those numbers translate into capabilities which is the I mean as I've repeated in the past the there's like a a one-two punch to understand our current ERA one is scaling laws but the scaling laws referred to as you say like compute model size and data set size and those aren't very visceral quantities and the loss which is the quantity that decreases with the power law it's sort of a very esoteric thing right it's just the accuracy on predicting the next token which doesn't obviously correlate with things we care about and it's the so that's the left punch and the Right Punch is uh the translation of those power laws scaling laws into capabilities and that's what we sort of see emerge in each generation and that's what's in the blue here right so somehow at each level of scale of unexpected things happen hard to predict in advance for example gpt2 suddenly starts to have something that looks like multitask learning or positive transfer between different tasks it benefits in one area from things it's learned in another area and then and then that continues with gpt3 and gpt4 but yeah I think that kind of infographic if it could tie those things together in a way that made it easy to apprehend um could do a lot for people who probably are more or less at
this point just kind of noticing the latest thing and not really seeing it as part of a trajectory which is the thing that really drives home where we're going I would say all right so gpt3 175 billion parameters was trained on a corpus when containing many different things internet text basically uh 570 gigabytes so that was in 2020 and the the title of the paper was language models a few short Learners and so the kind of key emergent capability at this scale was in context learning would be one way of saying it so roughly speaking without changing the parameters of the model just by giving it some information in the context you can basically have the model act as though it has learned something so it can modify Its Behavior and its understanding in a fairly sophisticated way based on sort of AD runtime this was also the first time where the scaling laws were clarified um so in 2020 there was earlier literature before that but not really in the in the context of this series of Transformer models and many different scales so scaling laws was the the other big thing from this generation anybody want to comment on gpt3 before I move on to gpd4 okay we do know the date just one very quick comment is that it is that it it seems like gpt3 spent quite a while in this kind of stealth mode um was there a period in which its potential was not uh was not recognized or understood um or not promoted was this sort of an active uh uh not not obfuscation but just you know um a a hesitancy to completely deploy disclose and deploy the technology for you know whatever reasons proprietary uh purposes trying to capitalize on it trying to profit from it trying to you know get alignment uh and not in the AGI sense but in the in the you know avoiding certain pitfalls and making it useful I mean what was going on why why the big gap between 2020 uh when it first emerged in sort of 2022 when the public really got a a full Taste of of gpt3 and then gp3 GPT 3.5 with um chat GPT is there was there something was there something going on there that we might be concerned or expect would happen again and give us sort of a winter between the current and then the next generation and maybe that's going on right now I don't know I don't think so uh I think mostly it was a business model problem rather than a technical problem my my mental model is of uh eliaskeva and the rest of the scientific team charging along basically on track making progress uh but the business side of openai didn't really know what to do with gpt3
basically nobody used gpt3 they they released this playground people did cute things with it basically it went nowhere right to my understanding now it was a very powerful model but it somehow wasn't directly super useful for making chat Bots for reasons to do with the existence of GPT 3.5 now people could have used it to do many of the things they're doing now and well for example we were using it a little bit now many of the things yeah that are happening now could have in principle happened with gpt3 but you would have had to do a fair bit of engineering on top of it in order to make those really robust so I think a lot of people's experience would have been ah this is awesome you tried a little bit but then you go to make more systematic use of it and the downsides and the inconsistencies and the hallucinations and the inability to really steer it mean that at scale when you're actually putting it in front of customers and not kind of favorably inclined insiders who will look past those floors you probably just see it's not ready for prime time so I think that was kind of the conclusion uh I don't think it was to do with safety that there was like a gap between 2020 and and instruct GPT or chat GPT so the technical development between 3 and 3.5 um well there was probably a few kind of detailed things but the the headline thing was rlhf uh learning from Human feedback which led to instruct GPT which is no longer just training on internet text that's kind of a secondary training phase where the model is refined to be better at following instructions and more tuned to be a chat bot which isn't it's kind of just a miracle that the original gpt3 can act as a chatbot just it's very clever it's in context learning again if you give it prompts that look like chats it'll be able to recognize that from its training set and play along but it's not specifically wanting to be a chatbot it's just a next token predictor but instruct GPT with this additional layer of engineering is actually wanting to be a chatbot and to follow instructions and to not spew racist nonsense or whatever um so basically with those guard rails and a more of a fine-tuned behavior to be a chat bot they had this product chat gbt which was kind of what people wanted when they interacted with the gpt3 and it crossed some threshold to being good enough to put in front of people and have them kind find it useful I guess but I think that was more of a business or social phenomena than a um a technical one I don't know if you'd
agree with that man I haven't been following as closely as you but it makes sense to me okay uh yeah all right so then there's gpt4 GPT 3.5 is roughly the same kind of sizes uh well actually it's significantly smaller so chat GPT I don't think they've publicly released it people estimate it's kind of maybe from one third to one-tenth the number of parameters that's kind of a something a bit funny where you do this rlhf process and you can get a much smaller model that's part of why chat GPT is much cheaper per token than gpt3 0.5 a similar process I think so nobody knows uh probably the same Corpus I think roughly um I don't think there's any new emergent behaviors that really are worth noting from the transition to three to three point five at least I can't think of any so let's go straight to the four nobody knows nobody knows uh that's kind of one of the most interesting things about gpt4 which we can get into more which is the degree to which even the basic research involved is secret not just the number of parameters the training set but well GPT 4 at least one version of it can handle an eight times longer context window than GPT 3.5 so you can put 50 pages of text directly into the context for gpt4 and ask it questions nobody knows how that works under the hood you can see the contributors that there's a contributions part of the gpt4 page and you can guess from some of the authors there I recognize some of their names and what they've done in the past so we can maybe have some educated guesses for what's going on there but nobody knows uh all right so maybe some headlines of what gpt4 is good at does that sound like a good thing to do now let me put it on the other board and then we can get into the details a bit more so headlines it uh managed to follow scaling walls which I found exciting maybe it'll break it but it didn't yeah scaling laws continue right uh that's a important one although the graph is kind of comical uh maybe maybe I'll actually maybe I'll try and fit it on here just a sec it's kind of well the Deep learning literature has always been a bit cagey about like by the normal scientific standards you know actually giving the information but now it's reached a kind of absurd level oh no that's not the one I mean well that was that was going to be my question is is if we don't know the specifics of these of the training set size and the number of parameters and so forth how confident we how are we still completely confident that the scaling is continuing
yeah or could it work out the parameters from the scaling walls they've given us maybe yeah people have attempted to estimate it but you see it depends on which scaling law uh like this there's enough variations that well when you go orders are magnitude up even small errors in which law can mean very big differences uh I think it's not trivial to work out from the published information all right so this is not like the scaling more posts we've sort of seen in the SLT seminar and so on have on the x-axis also compute and on the y-axis they would have some sort of loss measure but it would be for the true training objective which is predicting the next token on the entire training Corpus that is not what this is this is predicting the next token in the open AI code base itself which is not part of the training set so this is kind of interesting in its own right because this is gpt4 learning to predict its own guts right um and okay they have like one other graph that's a bit similar so these are Downstream tasks if you like right so this is not the true training objective but you train the model to some number of compute and then you test it on some held out test set so uh and a very particular kind right okay but uh the scaling laws seem to continue to work and they claim in the sort of supplementary material there's kind of there's a white paper and then there's a system card there's the web page itself and I'm drawing from all three of those so the interesting thing about this perhaps is that they they state that actually their main engineering objective for gpt4 was to have much more stable training and much more predictable training so I don't publish any details so and I have no idea what it means really but they claim that this is the first time they've trained one of these large models in a way that they predicted in advance exactly how well it would do at a Givens point and had that work out so somehow their ability to predict these trainings is the training of these systems so they claim is improving with each generation and that's kind of what this I barely even call it a graph it's like some cartoon but we don't know what the units are on the x-axis right for example will they release it do you think like later did they do that with ppt3 they were quite fast or I think Matt said gpd2 they were clear first and then no I mean these numbers like the number of parameters and the training set size were released at the very beginning I they made leak for gpt4 but I have a
feeling that this is it we don't we don't find out from now on I don't know okay there seems to be a secret paper with more information that you can apply to see uh so you may see like some people posting online about additional information they claim to have from that I haven't seen that paper so that paper May reveal additional details it may reveal something about the training set I have a feeling the number of parameters like uh it's no exaggeration now to say that the precise details of the scaling laws like the number of parameters or anything that would allow you to infer the number of parameters is uh not only a trade secret but soon to be like National secrets uh as strange as that sounds so I think they will strongly protect this information um because it's key to being able to reproduce the thing it's it's a no right right well because it it determines how many hundreds of millions of dollars you should spend on gpus for instance right and I mean a lot of things too um if you were I mean I can imagine quite a bit of information you could infer from those specifics if you wanted to know where a research project was occurring and you knew how much compute it take it took you could narrow down your search according to any energy consumption for example um so if we're talking about National level security uh and Military concerns you know the the protecting not only the nature of the training but you know the location and that sort of thing all of these things start to become relevant as this entire Enterprise shifts into Manhattan project kind of territory as we were talking about last week so it's it's uh yeah I think we sh I think for the most part we should expect more secrecy rather than less going forward other things being equal which I suppose is uh something to be concerned about in itself especially if alignment and other concerns equitability and and and so forth are if they are to continue to be prioritized and openness and transparency are means to those ends then I think we have some powerful forces working against those interests um we may again we may already be seeing that with this this latest iteration of the technology yep okay so let's quickly go through the headline features uh I would My overall summary would be uh no surprises and that is kind of terrifying okay so why no surprises well as we already discussed the scaling laws they claim continue and continue to translate into capability in this case it means exam performance that's the kind of
thing they have emphasized the model continues to improve on the hardest of NLP training sets um that's a little bit in the weeds from our point of view I think to talk about like what mmlu means but the the really interesting thing I think is that there's a shift from kind of dedicated tasks for machine learning algorithms to let's just run these things on real world exams for people this depends on the multimodal capability so multimodal meaning that the model can engine you can inject tokens that represent images into the model this has been seen in a few other models in recent years so in and of itself it's not sort of new but the integration into a system of this scale is certainly new so you can show gpt4 exam papers like the photo of the exam paper and it will answer and in many tasks like the uh the SATs first year university courses which are for the Australians so AP is a advanced placement Adam is that what it stands for it's like that's correct yep yeah so this is like first year University classes if I remember correctly is that right the the AP courses their purpose is that they they are translatable into Credit in University in the first year and so if you do AP courses in past with a static satisfactory score then you are Exempted out of uh university courses which you automatically receive credit for right yeah so there's some quoted stats about the bar exam uh gpt3 going from 10th percentile gbt4 is 90th percentile SAT Math 90th percentile AP biology sort of the Top 20th percentile uh sorry the top it's from like the top score anyway five which uh if I I think is like 85 to 100 um percentile it's not true equally across the board maybe I'll put up the next graphic it's an updated in the lead sorry but do we know what uh image corresponds to in tokens so you said like there's 32k tokens is like 50 pages like images oh yeah well yeah I don't know how many tokens per image that's a good question um I I I was going to ask perhaps a related question and it's probably a stupid one so I apologize but um are we completely confident that these images are not being analyzed and translated or transcribed or otherwise converted with a separate tool prior to being sort of digested by the model itself or are or is that is that what's happening or is or is the is it is it it is is the model interpreting it sort of natively I don't I'm not completely clear how this multiple this multimodality is being achieved if the if the images are being you know it seems to be it would be different if the
images were being rapidly scanned and their text dumped into you know a prompt and the prompt fit into the language model that would be in my mind quite different than the model you know directly interpreting the image without that layer of translation from a separate tool going on and so I I don't if if if that's something that can be explained very quickly that would be useful to me if it's not then don't bother I I can do my work later no it's not like that it's uh kind of more genuine so images are tokenized basically by taking I mean there's many ways but you could imagine so one method that's been used we don't know exactly I think what's going on inside gpt4 but the standard method is to take little patches of the image and then just map that to tokens so you basically kind of discretize chunks of the image how you tokenize images by making patches or some other process is uh probably a matter for that would be probably said part of the secret Source exactly what works well but you you directly translate the images into tokens um and then feed that to the model alongside the text so for example that's how say can and palm e like these modern robotics projects that integrate language models the language model is the brain and it receives tokens that represent text but also tokens that represent robot robot arm measurements State images from cameras everything gets embedded into this common language space via tokens so it wouldn't be the image tokens aren't in the same set like there's like tokens that stand for parts of words and there would also be other tokens that stand for parts of images if that makes sense okay that does make sense yeah thank you okay um is it just per pixel I don't know if it's tokens per pixel that seems like it would be a lot of tokens but maybe that's that's what I thought but I don't know yeah okay so capability exam performance um it does better on reasoning tasks I saw I haven't looked close enough to to compare to Minerva which is this um model that was fine-tuned on math questions and seems to do very well at least on kind of primary school high school and early University probably gpd4 might be competitive with Minerva I'm not sure so that was the kind of previous high water mark on um kind of reasoning at least mathematical reasoning the other really interesting thing at least for me is what I've written here is Auto catalysis Auto catalysis catalysis maybe not quite sure how to pronounce that so what I mean by this is the the model
itself being used to train the model this comes up in a few different ways and it's not really systematic yet so you can imagine that this will be something that will become much more systematic in GPT 5 which probably is training already uh so maybe I'll say briefly what I mean by by this uh it's probably fits into a discussion of the secret research so I'll just um clear this board so part of part of this is not Secret part of the um Auto catalysis I mean okay so here are the things that seem to be secret so the exact scaling laws that are involved uh how they stabilize the training we know bits and pieces of this it reminds me very much of um the early parts of the Manhattan Project where kind of people people were publishing and not quite sure why nobody else had published this thing which seemed to be quite like an immediate consequence of other stuff in the literature like it was kind of like a mix of something secret and some things not you could kind of figure out what was going on maybe that'll become harder over time GPU kernels that's it yeah can you define stability quickly there what do what do we mean by stability yeah so if you if you go and look at so there was a model called opt that was released by uh meta Facebook uh and this has a public record of how they trained the thing which is a very enlightening so when you train these very large language models it's um well as much Voodoo and black magic as was required in the early days to train any deep learning model so you have to get the learning rate right and then the thing crashes and then you have to restart from some checkpoints and you have to vary all sorts of hyper parameters like the training rates and the the parameters in the layer normalization and like this there's many Magic numbers in these systems and if you get them wrong you'll fall off the scaling law so as far as I understand it modern large-scale deep learning training is Guided by the scaling laws and you kind of know how well it should go and then if you deviate from that you know you've done something wrong so then you you kind of look at your system and you're kind of like okay well this isn't quite working but what isn't working and the stability what they say in the paper is that well it's very brief it's like one sentence they're referring to I think that a lot of this Voodoo seems to be have been banished by better engineering part of that we understand I'm looking into this at the moment but there's there's a technique for
transferring optimal hyper parameters from smaller models to larger ones I don't know how much of a role that played in gpt3 but it seems to be involved in a significant way in gpd4 where well it's just better engineering around all this nonsense that you had to do in previous generations and maybe we're just getting more industrial kind of quality of understanding of how to produce these objects GPU kernels that's just how you I mean there's a there's a whole distributed computing aspect to this right which is quite serious I don't have a good judgment about how much of that is secret and how much of that is public the long context research as I already mentioned is kind of one of the most striking things about gpt4 and we'll have to see how well it works and how dramatically it changes the game but in principle it's a huge deal right in principle this is like bye bye consultant [Laughter] is there is it possible that this is using some something like the method that you were using before Dan some some this is full not some not a hacker a sneaky workaround I don't think this is a cheat of course we don't know it could be but I I doubt I mean it would be so embarrassing if they promoted this 32k context thing in the behind the scenes it was just some kind of pseudo cheat now okay it's probably not as simple as just the same architecture as a normal Transformer just with a bigger context and then there's some kind of work around for making that actually function I mean the the you know it's like an O N squared kind of thing so going going from 4 000 to 32 000 you know gpus memory just will not handle that so it can't be like the most naive direct thing but you can imagine various various things more Central to the model I mean I could think of two or three ways to do it it's probably not super sophisticated so it probably is one of the three things I can think of but they just made it work so I think it's real I don't know if it's equally attentive to every part of that context so that it kind of is as smart on all those 50 pages as it is on the first five or whatever but I would bet on it probably being yeah like a serious deal which is I mean if it can attend to 50 pages as well as GPT 3.5 can attend to one or two uh yeah that's that's something yeah the the costs um will come out of like um training to 50 pages versus five pages and then you should be able to infer something about the the computational complexity of the that's true by whether or not that's 10 times more
expensive or 100 times more expensive it's only twice as expensive as 8 000 tokens per token all right so it's yeah whatever they're doing it's not that you just scale up linearly from what it costs for the 8 000 token thing but it's only twice as much yeah as I put there like it's uh so 12 cents per 1000 tokens for the 32k context version it's worth noting the GPT for itself um is six cents per 1000 token if I remember correctly which is three times what GPT 3 costs and 30 times what chat GPT costs so people talking about Integrations of gpt4 I really have trouble seeing how that's going to be economic for many of the applications so maybe that's a good thing to talk about next yeah one thing I would just it just observe here is that um price isn't cost and uh it it usually depends on the incentives so when we look at the pro the difference between price and cost in the disruptive technologies that we've seen in the new Industries the general lesson that we've extracted from looking at the various examples is that it's all about the uh it's determined by the incentives of the individual Enterprise um or the industry that's involved there there's there are a lot of Exceptions there are a lot of quirks that can lead to really quite Stark distortions between uh price and cost in a perfectly efficient and perfectly competitive market cost and price would track but they would correlate very very closely um so as you can imagine in Industries where that either suffer from Market inefficiencies or failure outright and so energy is an industry where that's the case because there are a lot of utilities that are effective monopolies whether they're state or or quasi-government or private and there can be a huge disconnect between cost and price there I don't know if it's the same here but you can easily imagine that this pricing strategy does not reflect particularly accurate accurately the underlying cost structure so so just just keep that in mind um and I think there's evidence I mean if they can drop the price by a factor of 10 for an older model uh that's a pretty good sign that that price and and costs don't correlate well enough to draw many conclusions from it um yeah I'm just the narrower noting that we uh I talked about the secret research secret research um but didn't come back to the auto catalytic uh idea Okay so uh Auto catalysis refers to um catalysis of a reaction by one of its products right so here gpt4 is the product uh where is it being used and where might
it be used in in the development of its own model and future models so it's a bit sort of in the details and maybe we don't have time to talk about it this time but the final training there was an additional training stage beyond our lhf in gpt4 which was I think it's rbrn isn't it yeah yeah uh Adam and I have both played with it I think it's sort of a staged rollout but if you're if you're playing for GPT GPT plus you you probably will get it soon I actually put myself on the wait list for it to try it out with the Bots here I got accepted into the wait list but it doesn't work at the moment so I don't know why um okay so there's a final stage rbrm where they use gpt4 itself to predict whether the trained model is following some sort of guidelines or not and use that signal to refine its training so um that's an interesting development the kind of big picture there is you know the model the model is able to well just like with people by reflecting on your own speech and thoughts you can sometimes see that they're inconsistent with your intentions or values or the instructions that you've been given in a way that you know if you were like a perfectly rational being with omniscient cognition you might be able to make every utterance consistent with all your intentions and all your instructions uh but somehow we don't quite work that way so it can be useful to reflect on the outputs that you give and guide them back into line with what you think your output should be or what you've been told your output should be and that's part of the training process now in a way that it wasn't in the previous generation so there was an rlhf involves training another model but and to some extent intervals gpt3 itself but in a much less thorough going way so that's the present um putting GPU kernels up here well gpt4 writes code very well I don't know if they used it to write the kernels but they will at some point Intel is already using uh AI systems I don't know if it's language models to layout chips as is Google I think it's highly likely gpt3 is being used to produce the code for gpt4 at various levels including GPT I mean the the kernels and that that would continue to become more prevalent I would think um yeah I don't know about some of these other aspects but I think that's going to be something you would expect to emerge over the next couple of years is much more integration of these tools into the production of the Next Generation any questions or comments about that
okay just some of the announced Integrations and collaborations um I think that's another interesting difference with this generation right partly this is the marketing skill of open AI which has left Google looking completely stupid over the last few months um despite all the chaos from the Bing Sydney debacle Microsoft somehow comes out of this looking smart or at least people seem to think so they seem to me like stupid sock bucket sock puppets of open AI to some extent and just Deep Pockets but I guess the public probably thinks of them as looking Innovative I don't know and here are some of the announced collaborations between open Ai and other companies Bane is a big consultant company that's quite interesting to read both the Morgan Stanley and Bane press releases you get the impression that probably it's mostly [ __ ] but the kind of [ __ ] that becomes true eventually so they're talking about how they're using open ai's tools to examine their huge libraries of PDFs all sorts of knowledge and to surface that in a way that can be acted on by analysts through chat Bots through the kind of techniques we've talked about that are being used in many places including here where you use embeddings and then you query embeddings and you bring those into the context of a model uh but with the larger context for the gpt4 model you can do this for real uh somehow in a way that querying little chunks of a book and then bringing them into context is kind of Faking it and it fakes it well enough for many purposes but if you're actually trying to really do a deep analysis on a body of text I think you can't really beat the genuinely longer context Salesforce slack is integrating chat GPT Khan Academy is going to release has not yet released but is has demoed a kind of tutor bot again with both that and the Duolingo Bots I feel like probably when you use it for the first time you'll be impressed the first two things you try and then you'll kind of feel that it's a bit of a pain and probably doesn't quite work very well and then over the next two to five years it'll just get amazing um and I don't know if you saw it but on the gpt4 webpage there was a Socratic dialogue I think they called it where the model was queried there was some like kind of high school math problem the person interacting with it got it wrong and then needed to be guided to getting the answer correct and all that kind of work and the model was doing pretty well in understanding what the problem was how to diagnose it what the
true answer was and how to get to that true answer uh so that's the kind of artifact of its Improvement in performance on exams right is that it um seems to understand sufficiently well why it gets the right answer to be able to explain it and to guide someone else in reaching the right answer it's interesting this is perhaps a false reasoning by analogy but but in the case of humans I think we're probably all familiar with that idea especially in in an academic any academic setting where um you know you there are various levels of competence up through expertise and and Mastery and up at the higher levels of expertise and mastery is the uh you start seeing an emergent ability to translate the understanding of richness and complexity and detail into elegant uh explanations that can that really that really do help convey understanding to uh to those who are less competent so basically be teachers that are really really uh uh um knowledgeable and knowledgeable experts are able to teach very well um because they know the material backwards and forwards and can see it from every angle and understand it from every angle every nuance and uh so this is and then the flip side of that is that um there's also the the um sort of Timeless wisdom that if you really want to learn how to do something you need you need to get good at teaching it and that people who teach become um that's a path to to sort of very very high level Mastery and uh you know this is an ancient idea and apprenticeship and Mastery is is you know is is nothing new it's ancient and so again we maybe I may be guilty of reasoning by analogy here but every step along the way of this sort of Journey as these systems are emerging in or at rather as these systems are demonstrating emergent capability I am struck personally and just anecdotally by how eerily similar uh it is to the way that humans reason that the way humans learn the way that humans teach uh the way that humans self-reflect now and and potentially improve now obviously there's risk of anthropomorphizing here and all of the rest of that I'm trying to be vigilant but the the similarities are striking and unnerving to me I don't know if everybody's having that experience but I just it isn't just one thing it's one thing after the next um so yeah remain uncertain how much of that is just due to the training set right um is after all trained on a huge amount of our digital exhausts so behaving in a human-like way is not it's hard to disentangle what is just
emulation and what is kind of a universal phenomena that you would expect even if it had been trained on like raw experience in the world with no exposure to our way of doing things I think humans themselves are not trained that way there are even a handful of examples and there's plenty of fiction about humans that have grown up in the wild as you know a feral humans and the feral mind does not possess I think there are a few very tragic examples um uh it doesn't possess any of the not any but there it's it's grossly um deficient in many things that that adult s that are enculturated and trained and educated by other humans I take for granted um so while some raw and latent intelligence is there and gets developed in feral humans uh we are we are ourselves more or less entirely dependent on the enculturation process of our education and so we're learning from all of that digital exhaust as well it's not so I mean it's a good thing that the these machines are too little AI That's there's a there's a new grunge band's name for you yeah I think that's the kind we don't want uh hallucinations have also decreased partly through effort and partly through scale so that's another interesting yeah I mean one thing that's worth noting every generation is just how in the previous generation uh people were very quick to say oh this is a fundamental flaw in this whole approach and you know this is why it's never going to work and why this is the end of the line and then the the thing varies but in the next Generation these problems don't always go away but they tend to be ameliorated in a way that suggests that it's not a fundamental critique um and you know I think we can we all know of articles that are talking about these kinds of things even sometimes publishing counter examples that you can just feed into the today's model and see that it handles them perfectly well and you can see some Digs at these uh some of the examples that they choose you know of Common Sense reasoning about images uh direct Digs at people's prophecies about this never being something that a model could do um at this point it's just kind of you show GPT for an image of a real world situation it's going to make sensible predictions about common sense physics and such things to some degree um actually that has implications of self-driving cars uh Adam right so uh you can't put gpt4 inside a car um but its capability of common sense I was going to ask you to add that to your list they're not necessarily
integrated with the self-driving although there's some there's some movement on that front uh with integrating llms into the actual software stack for the for the um the driving itself but GM is talking about using GPT to basically turn your car into Knight Rider which is what everybody's always wanted since we were little kids so we're just we're nearly there right it's not too different from the idea we talked about data about bringing your stuffed animals to life with gpt4 um and a simple you know audio in and out device um but yeah the the GM announced a partnership it's my to my understanding worst planet or maybe that I might have Insider news there and it hasn't made the formal announcement yet but apparently this is coming that it'll be able to effectively be a GPT 4 or or further based personal assistant in your vehicle which is kit from Knight Rider apart from the self-driving part that'll be separate and then when you have both of that you literally will have Knight Rider [Laughter] uh yeah maybe like it's kind of I don't know if you're watching the Bots uh the Bots have kind of gone stupid today um and I know exactly why it's because I changed the number of uh the number of thoughts that they have that they're caching in their context has been increased from eight to Fifteen and that has had an unanticipated effect of making them emulate each other so they're both basically behaving as a let's say a kind of mixture of triku and Doctor uh because they can't actually tell their own identity anymore because their identity text has become a smaller percentage of their context okay so this is an example of how fragile the current systems are even chat GPT right it's very good at following instructions relevant relative to previous models um but it's still fragile right it's got very strict instructions for example triku is supposed to never you know speak outside of haiku and it knows what a haiku is and so on but it's clearly breaking those instructions because that instruction text is just one of the 15 things in its context and some of them are quite long passages from books or whatever and it just kind of forgot who it is by the time it processes that context so these things aren't magic right and they're still very flawed so it's worth keeping that in mind this we're not at AGI yet or anything really that close to it um but these fragilities improve right so I would expect GPT 4 if I run it would probably just handle this um just like you know the previous my
Curie and the the sort of editions in between GPT two and three which you can access through the API they just can't follow these instructions at all gpt3 can but still has limits gpt4 if I get to try it um I expect to just do better automatically at some cost so you know the it's unless you're really in the weeds with trying out these Technologies it's hard to see exactly where the hype finishes and where the progress lies and where the limits currently are and that's one of the key reasons that I'm messing around with these things I partly think chat Bots are kind of cool and in some sense the future of Education or at least the part of it the big part um but partly I just want to know what I'm talking about and being actually you know with my hands dirty is the way to do that um yes that's just a word on where the limits are right now um I had fun playing with chat gbt based on the gpt4 model earlier today for maybe a better part of an hour um so I I I have a I have just a few minutes left but I'd love to share some of that experience I don't know how profoundly or useful the insights are but if if we have a couple minutes I'd uh it might be helpful um the first thing was it it's noticeably improved there's no question I I found it improved in every respect in the interactions that I had and I did not test any of the multi-modality at all I didn't you know do images or anything like that this was purely text and I'm I've been using it to do writing and improve writing quality I've been using it to um uh helped do a little bit of simple python code stuff for for my work where I have to interact with some of the code that that our team is producing and because of the time differences in my mornings I can't call people in certain parts of the world and just get the answers from the software Engineers but chat GPT has been useful because sometimes it can tell me what's going on or whatever and chat GPT based on the gpt4 is a is a big step of improvement in all of those respects it writes better it reasons better it thinks better it answers questions better it seems to know more generally about the world um it doesn't Dodge questions as much and it seems to be able to produce and explain code at least python uh better than than before so I was frankly quite I mean I don't know I don't know about stunned I mean some of it was pretty stunning but I was very impressed by how much of an improvement is there so that's exciting um and then one thing that that I had
one fun uh session where I I uh challenged the reasoning power because I saw several examples on openai's website which were really quite breathtaking of reasoning visual reasoning and then solving some riddles and some stuff like that so visual reasoning logical reasoning and um so I gave the the I didn't want to offer a a riddle that it might actually have already seen out in the wild I think I think giving these systems famous riddles to solve is probably not a good idea if they happen to have read them or been trained on them um so and it's actually surprisingly difficult to quickly think up a riddle so I instead constructed a situation with sort of a um just to kind of catch you out by uh with a simple thing where you're making a false assumption so that anyway here's the situation I presented to chat GPT today with GPT Force as its model the situation was okay um a driver has stopped their car and gotten out and did not return to the vehicle until the next day and then when they returned to the vehicle they were not able to drive away they had to call for roadside assistance what happened what what is going on what's the most reasonable thing to conclude about or what are the most reasonable things to conclude about this situation and so I I asked uh now I had a particular scenario there are a lot of different scenarios or several anyway that could fit that you know that scenario as I've described it a lot of different things could have happened and it was what I was curious to know was what would gdpt's reasoning be and so when I asked it and I was careful but it's asked unprompted I just gave it the this the situation I said what do you think happened and explain some of its reasoning and it wasn't too bad and it offered some points and it did a good job uh it it reasoned that there were things about the environment to to um try to understand things about the motives of the driver or the situation of the driver to understand and then things about the condition of the vehicle um both the the first day and the following day that that we're there to understand but it didn't catch all of them and then as as we've seen with better prompting uh the results improved substantially so what I prompted it to explain its reasoning it did quite a bit better it laid out you know more thoroughly uh uh the the possibilities and worked its way through them eliminated some things or at least laid out some possibilities that were a bit more thorough all of that and then as a final stage
um because it still hadn't gotten the scenario that I described or the one I had in mind I said okay now now after you do all of that um State all of the assumptions that you're making at each stage in your reasoning and and incorporate those into your explanation of your own reasoning in other words make your try to try to identify and and explicitly um uh uh draw out the assumptions that you're making about any of these things that you're talking about and so getting it to to to interrogate uh or question its own assumptions improved it even further and then it finally reached the conclusion that was the scenario that I had in mind which is that the driver got out of the car and uh had to disappear but left the car running and the car ran out of gas and then when he came back the next day it had no fuel and that's why it wouldn't uh it couldn't drive it away he had to get roadside assistance left the headlights on yeah well so it had worked through a number of different possibilities including the battery dying and so forth and that this was the most common cause of a car that won't start and so forth it also asked about what was it in a dangerous area was it winter time and so forth and um and as a final stage uh or sorry I I forgot to mention it's the last stage um I I the last prompt update the last modification to the prompt that I said was before you start reasoning ask me questions for clarification about anything that you don't they aren't sure about anything that you think you're you're assuming ask me to clarify uh your assumptions and then so then it did and before it started answering it asked a few questions and then it got it right or at least it out it did identify the scenario I was imagining as one of the possible scenarios explaining the the observations so here's so that anyway this was it was all fascinating it was very very impressive but here's the takeaway lesson that I had from this which is that there are um there is there's there's there's a great utility or usefulness in having good habits you could call them hygiene you could call them reasoning or logical or cognitive hygiene you basically just have good habits about things to do and I think those of us who have who are very highly educated and have a lot of training and a lot of experience in thinking uh in certain ways so if you have scientific training and you're you're at you're working in some academic capacity or intellectual capacity you're basically running all of this these
hygiene loops on yourself continuously you're you're checking and challenging and questioning your own assumptions your your um asking whether you're the things that you're observing or mapping to reality you're walking through your logic over and over again you're trying to you know you're just exercising a whole bunch of good thinking habits now what struck me as very odd with this with gpt4 is that is that behind the scenes because I don't know what's going on in the engine behind this thing what do I Know Dan you are able to and you guys are able to make inferences about what about that but I don't know I'm a label I'm a layperson but I am surprised personally that this isn't already being baked in before you get any response at all out of these systems this that's that was what surprised me was that I so if I if you were to ask me naively um I would have said well the easiest way to improve the the quality of of the reasoning of these systems is just to invisibly prompt them before they even you know interact with you uh but clearly prompting them further is doing more it's you're getting some benefit and Improvement in the performance so I found that very surprising now I don't know whether you know this is just like additional additional good hygiene and there's already a whole stack on that is there that we don't see in this just a bit more helps or what's going on but I found that really quite quite uh surprising and perhaps revealing so anyway take that observation for what it's worth maybe something maybe nothing um and then the the um one final um uh thought was um that that uh with the with this good hygiene with this with this sort of uh reasoning and stating assumptions and um uh like asking for clarifying questions before you answer all of this made me suddenly realize right at the end of the conversation that there's no filter there's no there's no moment or period of internal quiet thought processor dialogue from gpt4 at least not that I can tell now again you guys correct me if I'm wrong about this but what it seems to me is that it just starts talking yeah like it's like imagine if it and again I'm reasoning by analogy with human beings but it would be as if you at you asked a question to a person and then they just immediately opened their mouth and started blurting out their stream of Consciousness rather than thinking before they speak and of course what one of the most fundamental good habits of of you know cognitive or intellectual
hygiene is think before you speak don't open your mouth first get your thoughts in order think things through and then articulate your thinking well it strikes me as pretty clear that gbt is not really doing that if it's if it's only predicting the next token and then you know you can see it's kind of you can see the the the the the tokens emerging you can see the words emerging fragments of words that then become whole words and so on but what I wonder is especially for a now that it's got a larger context window I mean what about some recursion there you ask a question and it gives itself a minute to think it produces a whole response feeds that back into its own prompt right runs all of the good hygiene on that produces another prompt does that whole recursive cycle two or three times and then gets to a useful output now I would imagine that that that recursion could improve the result substantially and it would slow it down a little bit but that's what people do this is what humans do unless you're you know the jerk at the bar is just blabbermouthing without you know any filter at all and we make fun of that we mock that in other people right so anyway um these were just things that I was struck by in my just you know brief interaction with the system this morning and it seems to me that there are some pretty low-hanging fruit for making what I imagine would be quite quite strong improvements in performance just you know right out of the gate I think that even with these post-processing steps post training steps uh still gpt4 is kind of like the raw material right uh and then you can do what you're suggesting uh which people have observed that processes like that do increase the performance on reasoning tasks probably the Improvement is less now with gpt4 than it was with gpt3 I think that as the models improve the gains from this at least in some benchmarks May decrease but uh but for example if I was Morgan Stanley or any knowledge working institution I wouldn't just be running inputs through the model and then spitting it out I mean I would have I would be spending millions of dollars a year with a back-end process that is looking churning through all the documents synthesizing stuff writing essays having other models critique the ASAS and like this whole ecosystem of reasoning Bots and then surface like one paragraph to the VP you know like here's something we're missing or here's a risk that you haven't been paying attention to why not right like how many stupid things are
here it's just not paying attention to because of lack of attention it's not like they're deep right like do you think what went wrong with Silicon Valley Bank was like some deep thing that nobody could have noticed if they were just paying attention no like the world is mostly we you know sometimes there aren't deep problems that are really just difficult to see and figure out but mostly it's just a lack of attention competence and relevant knowledge I think it's we may be surprised how much leverage you can get over the current world by just having a lot of things pay fairly shallow reasoning attention to a lot of information I I strongly suspect that will be one of the perhaps it'll be a first an early stage and crazy things will come after it but I pretty strongly suspect that um what the first taste of super intelligence will have that kind of quality to it that that the things the sort of things that when like we there are there I've been throughout human history and then in each of our personal experience I'm sure there are instances where something is is very difficult to think of but then seems obvious in retrospect like if if if only you've been able to see something a certain way or pay attention to it in the right way or put the pieces together um sooner it would have been obvious and I think that is precisely the way in which these systems will seem super intelligent that things will be obvious to uh systems with unbounded attention um that are not obvious to Minds that have very very very limited attention and and context windows and short-term memory like humans do um so I think that I I strongly suspect that that's and we're going to be humbled by that I guess my guess is that we're going to be humbled by how quickly that emerges um and there may be some truly deep and profound ideas that are very difficult for anyone or anything to think of um but I wouldn't that's not where I would be putting my chips on the table um I would be putting my chips on the table of if you're if you could pay enough attention and your memory is good enough and and you sure and you've got a lot of compute and you've been trained on a lot of stuff everything's gonna seem uh obvious to you before you have to go Adam Let's uh go back to these pictures and uh reveal what they are so the left hand picture I don't know that it's the specifics data center that has the uh open AI supercomputer that Microsoft built over the last few years uh because I don't know that that's public information but
this is a Microsoft Data Center and we also don't know exactly what's in it but from public statements from Nvidia and Microsoft we know there's tens of thousands of a100s there's a recent report from JP Morgan I think uh claiming that there's 25 000 a100s in the supercomputer which is about right so I'm going to take that figure each a100 is sort of on the order of twenty thousand dollars um but I assume they have some discount anyway probably it's on the order of 200 to 300 million dollars for whatever the super computer is uh and it's interesting to compare that to the right hand which is Oak Ridge which many Americans will know but uh so that's the k25 gaseous diffusion plant um built finished in 1945 and cost 500 million dollars this was well when people thought that nuclear bombs might be possible there was a lot of Investigation a lot of Doubt right people thought I mean the obvious ways to do it seemed like they would just not work and then people figured out that uh separating isotopes of uranium was the way to do it but it seemed very laborious and you'd need hundreds of thousands of these things and that seemed kind of insane but here's a building which contains the gaseous diffusion devices that separate out the the right isotope of uranium and it was the biggest building in the world when it was built and that was over the course of four years uh from start to finish so I mean this is 500 million dollars in 1945 money I guess so maybe it's a bad comparison I didn't actually check whether it was inflation adjusted or whatever um but the point being that there are orders of magnitude uh above our head if if people start to really care about this and so interesting times maybe maybe it isn't right to think that the next model of GPT like the next version gpt5 um you know we may be reaching some sort of limits of scaling in terms of data set size and you know you can't spend two billion dollars every year on training models probably so maybe it starts to level out at some point depends how transformative the capabilities get and how big the applications are but it's it's sort of unclear what the trajectory is it's not automatic that it will be 200 million this year in three years 2 billion then 20 billion or whatever right but the example of the Manhattan Project shows you that things can really change and Investments and labor and uh effort in scientific directions can go more or less overnight from minimal to very dramatic so this this can happen