WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


AI alignment is the challenge of making sure AI systems do what we want them to. It involves understanding the structure of knowledge and computations within AI systems, such as neural networks and GPT5. This can be done by analysing singularities and phase transitions, which can be detected by measuring generalized susceptibilities. The lottery ticket hypothesis and Occam's razor suggest that the structure already exists and the training process chips away at everything that isn't the structure. Understanding the structure of AI systems could provide greater control.

Short Summary


AI alignment is the problem of getting an AI system to do what we want. The current status quo is an equilibrium of agents with relatively equal capabilities, and understanding the structure of knowledge and computations within these systems is necessary for alignment. It is not clear if understanding the structure of knowledge and computation within neural networks is possible, but if it is, it may provide a way of maintaining control. Five principles or assumptions related to SLT were outlined, suggesting that understanding the brain could help us get it to do what we want, although it may not be possible to create a textbook that explains how GPT5 works.
Singularity Theory is the study of the internal structure of singularities, such as the configuration of divisors and the topological properties. Different methods are used to analyse simple singularities, such as the category of singularities, group representations, and the Makai correspondence. Estimating LCTs involves understanding the behavior of the system near the singularity, as it contains a lot of information about it. Singularity is a mathematical concept related to internal structure, which is both localized and has a global effect.
The structure of a trained weight parameter is determined by critical weights, which store data and algorithms. This structure can be used to infer the parameters of a singularity and other kinds of structure, and can be used to understand complicated geometric forms. Understanding the final structure can be done by accumulating understanding over the course of training, similar to how programs are constructed from simple bits. Phase transitions can also change the structure, allowing for further insight into the knowledge within the model.
The speaker suggests that phase transitions observed during training of a general model like GPT can be compared to the act of sculpting a marble, where the sculptor chips away at the marble to reveal the image in their mind. The lottery ticket hypothesis suggests that training chips away at everything that isn't the structure, unlocking components that were already there. This is similar to Occam's razor, where the simplest path is preferred. Entropy of constructions can be discussed, as the simplest path is preferred in order to reach a global minimum on a graph.
Phase transitions occur when a parameter reaches a critical value, causing a discontinuous jump in the probability distribution. These can be detected by measuring generalized susceptibilities derived from energy and entropy. Gradient descent is used to minimise the loss and identify phase transitions, with a bias towards lowering entropy. During the plateau, a new capability is activated which rapidly drops the loss due to the reorganisation of the way the network computes the loss, minimising it and creating a drift towards the deepest singularity.
Phase transitions occur when observables in a system suddenly change, allowing for the detection of points of interest, such as during AI training. SGD has a bias towards the level set, which has been suggested by toy models, but not yet proven. To understand the structure of a final network, the relevant change instructions of a critical weight must be inferred by analysis of the model, similar to a spectroscopic probe in solid state physics. This process involves deformation of singularities and is roughly as hard as solving the final problem.
The locality hypothesis proposes that phase transitions involve a bounded amount of change, visualised using two level sets. To understand a complex singularity germ, one must perturb it and see if it breaks into simpler pieces. This is seen in the progression of a function from W*1 to W*, where each step involves a critical weight and an explanatory structure. Modularity does not necessarily have to be the same as the architecture, but rather a sense of locality that is implicit to the architecture and the way the neurons are connected. This could potentially lead to calculations being done in less time than training the whole network.
In chemical synthesis, tracking the constituent parts is necessary to build a large molecule. However, when it comes to training a large language model, the number of possible ingredients is not as large as the number of weights. To understand the neural network, structure is interpreted as something related to singularities, and basic tricks can be associated with human comprehensible concepts and logical rules. The universality hypothesis suggests that large neural networks trained on different data sets can discover similar structures, which could be related to performance and have a symbolic discreetness to them. Investigating this relationship could lead to a better understanding of the division into heads.
Machine learning is a process of accumulating changes to discover a construction encoded in trained weights. This process can infer symbolic structures and transformations, but when scaling up, interpretability may be lost. To ensure robustness, it is important to stick to concepts such as phase transitions and universality classes. Structures found from singularities in deep learning are difficult to interpret, and require a lot of human supervision. The inference of properties of materials is similar to the process of structure finding, which can be iterated to refine the inference, but at the cost of time.

Long Summary


AI alignment is the problem of getting an AI system to do what we want. Understanding the structure of knowledge and computations inside these systems is necessary for alignment. The current status quo is an equilibrium of agents with relatively equal capabilities. AI alignment requires either having an understanding that humans can follow or having systems that can understand other systems.
Humans have inequalities and some are oppressed, but it is a matter of degree rather than kind. There is some interpretability in the system, and with the right capabilities, one can understand and succeed within it. It is not clear if understanding the structure of knowledge and computation within neural networks is possible, but if it is, it may provide a way of maintaining control. It is not certain if this will work, or even if it does, if it is sufficient, but it is the best solution currently available.
The speaker discussed the possibility of understanding the structure of knowledge and computations within the brain. They drew a distinction between looking for circuits or programs and what they were talking about, which was not necessarily the same. They outlined five principles or assumptions behind their thinking on this topic, which broadly related to SLT. They suggested that understanding the brain could help us get it to do what we want, but noted that it may not be possible to create a textbook that explains how GPT5 works. They then went on to explain their ideas in more detail.
Data can be used to gain knowledge, which can be found in the coordinates of a true parameter. This knowledge is contained in a singularity. However, it is not clear how the structure of the truth can be linked to the structure of the singularity. Singularity Theory is largely about trying to define the structure of singularities, such as the configuration of divisors and the topological properties.
Mathematicians study the internal structure of singularities in various ways, such as the category of singularities, group representations, and the Makai correspondence. These methods are applicable to simple singularities, but not necessarily more complicated ones. Even for isolated singularities, mathematicians may only be able to say something structural about the category, such as that it is composed of indecomposables. Classification programs are limited to finite or countable singularities.
Singularity is a mathematical concept related to internal structure. It is both a highly localized object and has a global effect, and its behavior can be felt far away. Estimating LCTs does not necessarily involve sitting at the singularity, but it is still possible to infer something from it. This is because the behavior of the system near the singularity will contain a lot of information about it.
The structure of a trained weight parameter is a function of the structures present in a special intermediate set of weights, known as 'critical weights'. This structure is both data-based and computational, as it stores information about the data, as well as algorithms for determining certain elements of it. This structure can be used to infer the parameters of a singularity, as well as other kinds of structure, and can even be used for models that are not realizable.
Data and programs are the same thing and knowledge can have both static and dynamic forms. It is not plausible to take the parameters of a trained model and understand the knowledge within it. Instead, the structure of the knowledge in the final weight is gradually acquired during training, although this is an assumption. Understanding the final structure can be done by accumulating understanding over the course of training.
The speaker is discussing how a complicated geometric form can be understood by inferring the structure of weights during training. Through a series of steps and transformations, the final constructed object can be understood as a function of the structures and intermediate steps rather than the weights. The speaker suggests a path to understanding the structure of knowledge in the final trained weights, similar to how programs are constructed from simple bits. The speaker also mentions phase transitions which can change the structure.
The speaker suggests an alternative to 3D printing, such as chipping away at a slab of stone, to explain the phase transitions observed during training of a general model like GPT. The speaker suggests that the division between the marble and the sculptor's mind is different from the division inside the network, as the image is in the sculptor's mind while the network does not have such a division. The speaker puts forward the idea that these phase transitions indicate comprehension of various conceptual ingredients of larger problems.
Phase transitions occur when one local minimum on a graph becomes global. The lottery ticket hypothesis suggests that training chips away at everything that isn't the structure, unlocking components that were already there. Entropy of constructions can be discussed and it is possible that the simplest path is preferred. This is similar to Occam's razor, where homing missiles are preferred over ballistic missiles as they have multiple chances to course correct along the way.
Phase transitions are assumed to be associated with important changes in structure. In Bayesian learning, a phase transition occurs when a parameter (e.g. the number of samples or sparsity) reaches a critical value, causing a discontinuous jump in the probability distribution. This is an inference problem, as it is not possible to determine precisely what a critical weight is. The sequence of critical weights should be sufficient to infer the structure of the final trained weight.
Phase transitions are detectable by measuring generalized susceptibilities, which are derived from energy and entropy by taking derivatives. These can be measured in the lab and when the number gets very large, it can indicate a phase transition. During training, one can measure the overall loss, as well as the model's parameter changes, which can indicate a phase transition. Gradient descent can be used to identify these phase transitions, which can help us understand the microscopic world.
Gradient descent is used to minimise the loss, however some points of the loss are more equal than others. This is due to the bias towards lowering entropy. During the plateau, a new capability is activated which rapidly drops the loss. This is due to the reorganisation of the way the network computes the loss, minimising it and creating a drift towards the deepest singularity.
Phase transitions occur when observables in a system undergo a sudden change. This can be used to detect and mark points of interest in a system, such as during training of AI models. Detecting these phase transitions is easy and can be done cheaply, making it a useful tool for AI safety. It is also possible to infer structure from these critical weights, although it is not yet clear how to make this scalable.
SGD has a bias towards the level set, which has not been written down anywhere. A one-page calculation suggests this is likely true, but not a proof. There is enough evidence from toy models to suggest this hypothesis is correct. Existing literature assumes differentiability and local curvature of the derivative, but this is not true for the optimal parameter set or the objective function. Papers acknowledge the optimum is a manifold and study them near a new point of the manifold.
In order to understand the structure of a final network, the relevant change instructions of a critical weight must be inferred by analysis of the model. This process is analogous to a spectroscopic probe in solid state physics, which can infer the band structure of materials. This process involves deformation of singularities, and is not as simple as just learning the structure of the critical weight at every step. Inferring this structure is roughly as hard as solving the final problem.
The speaker is unsure about the idea that a machine can learn the transformation between two structures. He suggests the "locality hypothesis" where the critical weights that affect the phase transition are relatively small, implying that calculations can be done in less time than training the whole network. He also suggests that the hypothesis of modularity does not necessarily have to be the same as the architecture, but rather a sense of locality that is implicit to the architecture and the way the neurons are connected.
Locality hypothesis suggests that phase transitions involve a bounded amount of change. This is visualised using two level sets, where a function is deformed and its geometry becomes more or less complicated depending on components merging or splitting. To understand a complex singularity germ, one must perturb it and see if it breaks into simpler pieces. This can be seen in the progression of a function from W*1 to W*, where each step involves a critical weight and an explanatory structure.
In chemical synthesis, tracking the constituent parts from the start may be necessary to build a large molecule. This is because the parts are simpler than the thing being constructed, like bricks in a house. However, when it comes to training a large language model, the number of possible ingredients is not as large as the number of weights. This suggests that a team of probes should be used to track any emerging interesting structures, and watch them match.
Structure is interpreted as something that may be related to singularities and is useful for reasoning and interpreting a neural network. It is not directly interpretable but is a starting point for understanding it. Examples of basic tricks that neural networks do can be put in correspondence with human comprehensible concepts and logical rules. Tracking observables and evaluations that are human comprehensible can be associated with induction heads and geometric structures.
The universality hypothesis suggests that large neural networks trained on different data sets can discover similar structures. This could be related to circuits or network architecture. Investigating this relationship could lead to a better understanding of the division into heads, as the structure is not obvious. This hypothesis could help explain why the structures discovered by neural networks don't always make human sense. If the hypothesis is true, the structures could be related to performance and have a symbolic discreetness to them.
Machine learning is a process of accumulating changes to discover a construction and the construction being the kind of encoded program in the trained weight. This process can infer symbolic structures and transformations that assemble them. However, when scaling up this process, the interpretability story may not remain the same. To ensure robustness across training, it is important to stick as close as possible to concepts such as phase transitions and universality classes. This will help maintain the accuracy of the machine learning process.
Structures found from singularities in deep learning may be difficult to interpret in human terms, as they require a lot of human supervision. The idea of structure is not necessarily the same as current notions of computational programs or algorithms, and is instead more like materials engineering. The inference of properties of materials is similar to the process of structure finding. Iterating the structure finding process can help refine the inference and increase resolution, but there is a trade-off between how much time it takes and how much resolution is achieved.
A paper was discussed which took Bert, a Transformer model, and compressed it into a fraction of its original size while keeping most of its performance. This could make it easier to apply interpretability techniques. Distillation and structure inference could be used to create a grid of representations, from the most complex form of weights at the top to a human interpretable form at the bottom, with a smoothly varying version in between. Gradient descent against a feature activation was used in the very first interpretability papers on ImageNet to determine the source image that most activates a node.

Raw Transcript


so this is Talk number two on the topic of SLT for alignment in some sense independent of the first talk which you could view as just setting the scene and providing some background so I'm not going to speak at all in this seminar really about um what alignment is uh or what are some of the what is some of the prehistory of the idea that you might be able to interpret what's going on inside neural networks that was covered in the first talk I'll just say briefly that AI alignment is the problem of getting an AI system and potentially more intelligent and more capable than any individual human or all of human civilization together to do what we want again I want to stress that this isn't the problem of solving philosophy right it's not the question necessarily of figuring out what we should want or what God wants us to want and then putting it into a machine it's just more fundamental problem of even if we knew the answer to that or or some other thing we want them to do how could we get them to actually um do that if you're concerned with alignment it seems robustly useful to be able to understand the structure of knowledge and computations that are taking place inside these systems and that's where we are going to enter the scene it isn't sufficient for alignment I don't think maybe not necessary but it seems to me that it's hard to imagine a really deep solution to this problem that scales that doesn't involve something like either having an understanding that humans can follow or having systems that are able to understand other systems in some appropriate sense so it seems robustly useful to understand the structure of knowledge and I invite disagreement on everything I'm going to say today I care about this very much I think it's important maybe there are a few things more important uh and I don't want to waste my time so you think this is all tell me my ego isn't the issue here it's uh to be frank the survival of my child not my ego that I'm concerned with out of that um yeah I things are vastly useful I think I agree with that but um I just I wonder how you think about the example of um the status quo with humans um in charge um and our institutions some of which we don't understand um in this robust way um yet things are going as well as they're going so yeah I wonder how this fits into your picture yeah I think that's fair um I think the current status quo I would describe as a kind of equilibrium of relatively equally positioned agents in terms of their capabilities there's a
distribution of intelligence and capabilities and competence but it is a distribution and there's you know the standard deviation isn't you know huge So within that system of in that equilibrium we certainly have inequalities and many individuals on Earth are basically in a position where they have no control over their lives something like what we would be afraid of um if super intelligence was to was to take over things but it's maybe a matter of degree becoming a matter of kind right so I think you could say that the most oppressed humans on Earth historically are they really in a completely different position to to gorillas hiding out in the mountains from Human civilization waiting for the day when their particular mountain is declared to be useful for something maybe not so much um but I'd I would settle for kind of getting it within Striking Distance of the current situation I would agree with that I don't I think just being somewhat in the game would be already a lot uh so maybe I'm thinking more like the relationship between gorillas and people than the relationship between the poorest human and the most powerful human and when you talk about institutions yeah it's true there isn't necessarily a lot of interpretability but there's enough right I mean we do teach children how to integrate into society and if you choose to and have the capabilities you can get along most of the time unless you're being you know unless you've been colonized by someone and that all roots to survival are sort of cut off to some degree you can understand the system and succeed within it I think it's not completely uninterpretable yeah okay and I mean again I'll say that I don't I do think that more interpretability would probably be helpful I would agree though that it isn't yeah everything I'm about to say today I am not sure I actually believe a it will work and B even if it works then it is sufficient this is the best I can think of right but I very much agree that we might be in a position where there's just this really no easily foreseeable or believable way uh to maintain control um so I I as I said last time I think of this I'm not necessarily very pessimistic about it partly because it wouldn't be useful to be that way um but I'm putting this forward as uh in the spirit of if we get lucky maybe it's in this direction if that makes sense okay so it seems robustly useful to understand the structure of knowledge and computation within neural networks it's not clear that that's possible you
could say that well you could say that about the brain and we seem to be not much closer to understanding what the are there any deep mathematical objects within the brain that explain our Behavior like little symbols lurking there that are running programs inside our heads uh maybe right some serious people think so one of them spoke at our Festival earlier this year uh I don't think it's ridiculous am I looking for programs in the brain essentially when you go looking for structure of knowledge and computations in even bigger well not bigger even more uh less familiar systems perhaps um yeah maybe it's it's a bit analogous but I wanted I want to maybe really draw a distinction between looking for circuits or programs or whatever other analogy you might have with modern ways of talking about algorithms looking for those inside these systems and what I'm going to talk about which is not it's it's got some similar Vibe but it's not necessarily the same thing so that's one distinction I want to draw okay um even if you could do this even if you could understand what these networks are doing it doesn't necessarily mean you can get them to do what you want as I said but maybe that's helpful and understand here needs you know maybe there's many ingredients there do I mean that there's a 100 page textbook that tells me exactly how gpt5 is Computing and if I just study it then I'll understand probably not do I mean that there's a textbook and if I read it and I use a system that together we can maybe understand a system which together with another system can understand the system which can understand a system which can understand GPT file maybe that's getting closer but I don't really have a clear idea of how that chain would work I think it's many people have thought along those lines and so I'll sort of leave that for future discussion maybe or you can suggest something about it later but I want to jump into the there's a document online I'm not just going to I'm not going to just read it I'm sort of going to represent some of the ideas along the following lines of emphasizing uh I think it's five sort of five principles or assumptions that are sort of behind my thinking on this topic goes without saying given the title that broadly this is about SLT that's one ingredient in this story but it's sort of informed by a few different bits and pieces the first principle here I spoke about last time we're familiar with that uh from these seminars so I'm going to State this as an
assumption so we assume that the structure of knowledge and I'll try as I go to point out reasons why I think these things might be false in addition to them being reasons they might be true the structure of knowledge and computations that we're looking for is reflected in the structure of singularities Okay so so in some sense it's a fact that as what Anubis is in his introduction the knowledge to be gained from data is contained in a singularity but we have to be careful not to over read that statement so here's a picture of the fact the fact is uh if you take a model and the model is uh put in our usual framework and w0 is the set of true parameters and this here is the deepest singularity then in some sense well any true parameter is true of but in some sense the actual numbers in that true parameter well they compute a model of the truth right so those numbers that's the knowledge you could say right the numbers or you could say that the knowledge is the model because you can use that to make predictions um we know that if you're making predictions for the Bayesian posterior don't just need that number you need a lot more but this is the kind of knowledge that we're talking about the coordinates of a true parameter um okay but that doesn't necessarily imply that so knowledge about data or about the generating process behind that data more precisely we imagine often that the generating processes that make the world the way it is a highly structured right the laws of physics are highly structured many of the phenomena we're studying are highly structured and that's part of the reason why we prefer certain kinds of models and we have Occam's razor and it doesn't refer directly to structure but we can kind of see some correspondence between some of our principles for model selection and the preference for certain kinds of structure of others etc etc but it's not clear given the current state of SLT at least and I'm not aware of any other domain of knowledge which will make this explicit that there is some definition of structure for which structure in the truth whatever that would mean can be really lined up with the structure in this case of a singularity now singularities can have structure uh that structure can be some as Singularity Theory at least in its modern Incarnation is largely about trying to say what you mean by the structure of singularities and there's a few different ways we've talked about some of them for example you can talk about the configuration of divisors and the
exceptional divisor of a resolution you can use category Theory something called the category of singularities which uh the structure of the Indie composable objects of that will sort of correspond geometrically to the atomic pieces that are sort of inside the singularity you can use group representations uh if your Singularity comes from quotienting by by a continuous group is by a group so if your Singularity comes from symmetries then the in decomposable representations of that group have a lot to do with the structure of the singularity and it's uh it's called the Makai correspondence okay so there's lots of different ways a mathematician might answer the question how do we think about the internal structure of a singularity and the the first kind of assumption here is that something in there is useful in connecting with perhaps a less formal idea of what the structure of the data is or the generating process so that's the first assumption that we can draw some parallels between these two kinds of structure I don't assert that I know it's true okay in some examples it clearly is sensible in general I don't know um okay so that's point one any queries or critiques on that so um maybe just a comment a lot of the the ways that we study singularities in mathematics that you mentioned they tend to be applicable for very simple singularities right you know the Makai correspondence are you know probably the nicest of the nice right and like the derived category thing again right if if your singular it's complicated do people know what the structure of that thing looks like I my feeling is that the singularities we'll see here are quite a bit worse right oh yeah for sure in particular in particular mathematicians basically only understand these structural points of view well at least the categorical ones for isolated singularities which right most definitely uh w0 does not have yeah I think that's absolutely true um so I would not expect that we can really classify um so in much higher generality uh you know you might be able to say something structural about the category for example that it has in decomposables right that everything is built out of and decomposables but you won't be able to classify them no yeah I mean I mean classification program is really just we decide to look at these things that are finite or countable or whatever right like if we're forced to study a certain class of singularities I think most of us well I guess I don't speak for everybody
but I probably would be super stuck right yeah that's right so I'm putting these categories up not so much I mean right now I don't really see a plausible path towards applying those things directly to understanding any non-trivial neural network um but uh it's more in the vein of indicating that mathematically we have some idea what it means to talk about internal structure of singularities and there is often something kind of discreet and combinatorial about it whether that's the exceptional divisor or these categorical examples although the rest of the problem kind of looks like it's continuous there's a hidden discreteness sometimes when we look at these internal structures and I think that's an intuition that you and I have that I think is kind of worthwhile here although I would put an asterisk on it for the same reasons you are yeah statement about knowledge to be gained is related to singularity this this is because the set of w naught contains Singularity it's just my understanding but what if like you have to say a ball of radius Epsilon around the point w North I guess all the points in that are still into my interpretation still getting some knowledge because they're getting closer to the true parameter but does that set actually contain like honest singularities well if you just have a parameter that's close to a true parameter I mean it depends what you mean by Singularity right so you if you took a point that's not on w0 you would know pretty much the true distribution I guess uh you suppose you took it like within the you know extremely close within the floating Point uh resolution or something like you know you essentially know the true parameter but you don't have a singularity that's true but we know that the Singularity is both a point like a highly localized object but also not right so Singularity is something that has an effect far away on like the level sets of the function or the behavior of trajectories or the topology of the of the Lost surface and all those effects can be felt kind of far away in some sense sort of globally uh so Singularity is not only a local object it also has a global effect so uh in that sense you know the the behavior of the system near The Singularity will will also contain a lot of information about The Singularity and that's ultimately why it's plausible to actually infer something right I mean when we estimate our lcts it's not necessarily our Markov chains aren't sitting at the singularity right they're
bouncing around somewhere nearby but there's enough information about the singularity in its neighborhood to infer for example the rlct which is part of the structure of the singularity and part of the idea of this program I'm outlining is that we can hope to learn more than just things like rlcts and there's like other kinds of structure we might hope to infer but we haven't really thought about yeah that makes sense especially like having the area around the singularity affecting it what place to make yourself have intuition about how it could relate to knowledge as you approach laying something I have a comment on uh on this which is that um uh sorry then say if we have something zero on a true parameter then we have the true distribution but I guess one interpretation is that we don't care about the true distribution we we want to write that the the knowledge part of that sentence is not about being able to just um compute the posterior predictive or something like that or just having the the true model um it's somehow about the discrete structure of of of the model so the knowledge is more about the model than it is about the truth that's kind of my interpretation and second perhaps more important point is that when we talk about uh and and perhaps more illustrative of the fact that knowledge is about more about the model than it is about the truth or more about what the model know about the truth is that we could have a model that is not realizable and we often do have that which means that we are not talking about the set of true parameters we are talking about the optimum or some some other level sets um in in the model uh all right so number two structure is constructed over training we assume that the structure of the final trained weight parameter is a function of the structures present in a special intermediate set of special intermediate States which I'm going to call critical weights and Transformations between these structures so I want to say that when I say structure say in knowledge or in singularities I partly mean that in the sense of data and partly in the sense of program right I mean if you think about the final learned weight of a transformer is it is it data well part of it is data right it remembers Euclid that's data it's not like a program but part of the information in those ways is also clearly computational and algorithmic right how to determine given a sentence what the verb is or whatever all right so to a Transformer there isn't a
fundamental distinction between data and programs and that's sort of consistent with the way computer science has been thinking about this for a long time so it's not really a surprise if you come from a theoretical computer science background that these things would be the same thing uh but I think if you're not coming from that kind of background and have a more colloquial view of what information is you probably think of a fundamental distinction between the fifth word on the sixth page of Euclid and an algorithm for adding integers right those seem like different kinds of information but you should not think of them that way and that's the sense in which I mean knowledge it can have this kind of static form and also a dynamic form okay number two structure is constructed over training I do not believe that you're going to take the billions or trillions of parameters in a trained model and somehow do anything to them and get a sufficient understanding of what knowledge is in there to actually be able to align it I find that kind of fantastic um foreign just the general principle of you can understand what you can build something like that you also wouldn't really think about understanding a program that way if it was large or at least you could try but it isn't usually how we think about say certifying something as safe just interacting with it and at some surface level okay so if that's not plausible how would we ever understand a very large trained neural network well you could hope that the structure that is present in the final weight w star is gradually accreting that's obviously a fact right it's trained by gradient descent so it's acquired gradually uh that doesn't mean that necessarily to understand the final structure the structure of the knowledge in the final weight that you can do it by somehow accumulating understanding over the course of training that's why it's called an assumption right so I'm assuming that that's the case at least to a meaningful degree that is that there are intermediate states of the system training weights W1 star through WN star such that if you were to let me do a bit of a cartoon so maybe I'll let W be a weight and I'm going to write SW for the structure whatever that is I'm not saying what what I mean by structure right that is an undefined term but it hopefully it acquires some meaning sometime soon and you'll sort of see the flavor of what I mean but anyway it's sort of undefined for now maybe it's one of the things on the previous
board in in some sense um okay and I'm going to draw sort of pictures that aren't meant to literally illustrate what kind of structures I mean but they're they're hopefully evocative okay so from left to right I'm proceeding from the beginning to the end of training and at the end of training and drawing some sort of complicated geometric form and the task is to understand that now we would be lucky if I mean at the beginning of training there's very little information in the weights maybe not so much structure and maybe it's easy to infer what is there uh Suppose there was some collection of Weights over the course of training that you could infer the structure of them and also that somehow the steps were close enough that the relationship between the structures was also something you could understand then you could understand the final constructed object as being constructed from the structures and intermediate steps that's what I mean by a function right I mean that this thing here is a function of these and you if you know those you don't so much have to know the actual weights right so you can make it just the function of the structures rather than off the weights so you don't have to go from here to here you can kind of go from here to here I don't literally believe that's completely true but uh maybe there's enough structure and this is playing an important role here as well so um it's going to be underemphasized maybe in this talk but I think it seems probably important not only to be able to understand the structure but also what well there will be phase transitions once I get to point three that's these blue arrows here how they change the structure so it's a story of structures and transformations of structures um if that were true then it seems like you might have a path to understanding the structure of knowledge in the final trained weight those of you coming from computer science will recognize that you know there's a there's a large dose of the kind of Curry Howard point of view here right what you think of programs as being constructed from simple bits and the construction itself is the program that's sort of the analogy here so three is about phase transitions but maybe I'll pause here yeah yeah um can't find the right analogy if you think of I'm like I find this quite compelling but I'm thinking of like a a failure mode um What If instead of W star constructing piece by piece layering on the um the structure which is like some I guess the analogy is to like
building a paper mache um sculpture or something where you kind of you 3D printing like you incrementally build the thing what if it's more like chipping away at a slab of stone um so you start with a mess and then you remove stuff and you end up with structure remaining but at no point was what you had um any simpler than what you ended up with um until it was you know literally there so that like the sculptor is you know they have the picture in their mind and they're kind of working towards that um so yeah I guess that would be that would be my my picture that is different from this for which this doesn't look so easy where you sort of go in the opposite direction if you like um of these diagrams yeah I think that's a good point I think I would have basically thought that was the case before seeing uh the nature of all these phase transitions that we see during actual training so um I mean if you're if you're sculpting a statue the image is in the sculpture's mind the sculptor's mind and before they're finished maybe you don't see anything right maybe you see an arm poking out but it's not like there's really like somehow the I think the difference there is that there's a division between the the stone the marble and the sculptor's mind right uh and the image is in the sculptor's mind at least in part right they have some idea of what they're doing and then they they acted out on the marble but there is no such division inside the network right it's it uh so I think the so what I'm saying is that under that view it seems hard to explain why if you measure the performance of a network on many different kinds of tasks if you're training a general model like GPT why it would undergo phase transitions for certain tasks at certain times and why it wouldn't I mean that that makes that suggests that this kind of structures that are coming online independently of other kinds of structures and not that everything is just kind of slowly becoming clear all at once you know what I mean yeah yeah okay but I agree that it's I mean it's impossible to rule out of course we don't we don't know um but it's I wouldn't be putting this whole story forward if there wasn't this uh this picture of these phase transitions that seem to indicate comprehension of various pieces which you might possibly call you know conceptual ingredients of of larger problems um if that picture was missing empirically I think I wouldn't find it very credible that there was something like the picture I'm putting up here
can phase transitions happen when I know there's like a there's a picture in my head about two um to Minima on a graph um and the graph slowly changes so that one of the Minima their local Minima and one of the Minima becomes Global and then then instead of suddenly switch it to the other one um I'm wondering about the lottery ticket hypothesis which might give you a picture something like the the structure is in there and it's kind of what training does is it kind of literally um chips away at everything that isn't the structure so that it all either cancels out or or is um taken to zero and and then maybe when like certain components are not created when you see those face transitions but they they suddenly become the determining fact they're like they're unlocked right and but they were already there still and so does that make sense so you get these phase transitions yeah I follow um I think I don't I can say what's in my head I don't think this is like a coherent story but uh mathematically there's there's something here I think so we're used to talking about entropy of structures right um but you can talk about entropy of constructions so here's the final trained weight right and maybe here's a path that kind of goes by uh something like a construction something like the previous board and here's here's the the sculpture the uh the kind of approach we're talking about where you just chip away at bits here and there and somehow it all comes together at the end um yeah I wouldn't know how to prove it but it seems pretty plausible to me that the construction is something like the the lowest uh like the simplest simplest path um that right maybe some kind of principle of model construction selection or something where uh you prefer simpler explanations I mean the explanation I mean it kind of makes sense right I mean the the sculpting thing you have to somehow coordinate this activity across many many steps and so by getting it right it's like something you need a lot of information whereas the the idea of just getting individual pieces of the story and then somehow them coming together has a feeling of like an Occam's razor kind of principle I'm imagining something like a homing missile versus a ballistic missile um so with the sculpting you have to have very good aim from the very beginning but if you have a missile that sort of steers itself it gets multiple chances to course correct along the way yeah maybe but I would agree that it's very unclear I mean what I'm outlining on this board
is this one step away from speculation I don't have any strong reason to believe that other questions or comments about this one okay then I'll move on to three critical weights okay so the question left open by the previous thing is well okay even if that's true how you're supposed to come up with these critical weights now of course it's not precisely defined what a critical weight is but the sequence of critical weights is a sufficient con and necessary condition which is a sequence of Weights will be sufficient if the structures Associated to them are enough that you can construct from them and the Transformations between them the structure of the final trained weight um do you follow what I mean so it's not like you can tell what is a critical weight but the sequence of critical weights should be enough to infer the final structure and this is fundamentally an inference problem right this is not you get the answer precisely right but it's sort of can you infer enough about the structure to for example be you know really confident that paper clips are not on its mind we assume that the important changes I thought there was a paper I posted in the Discord a few days ago which I thought had a very good way of putting it that I wish I'd thought of which is sort of assuming that the capabilities that are required by Network are quantized that they come in chunks that's of course basically what I'm talking about right uh we assume that the important changes in structure Associated are associated with phase transitions for example of the Bayesian posterior but I don't care really that's the the thing which it seems for theoretical reasons you would expect to have uh phase transitions but whatever it could be some other probability distribution just something that you can tell that it happened okay so we assume that if they're associated with phase transitions uh just to say briefly so a phase transition in our situation so if we're talking about the Bayesian posterior you imagine some kind of parameter maybe it's the number of samples maybe it's some other parameter like sparsity which causes the Bayesian posterior at some value mu so let me call it mu C so for values of mu less than some critical value the probability distribution is over here and for values greater than you see it's over there so there's a sort of discontinuous jump of some probability distribution which is parameterized by some parameter mu that's an example of a phase transition in Bayesian learning uh well you know
it's not the time to talk in detail about what phases and phase transitions necessarily are but I hope that helps the point is that these things are detectable cheaply by looking for something like generalized susceptibilities so this is a general term in thermodynamics for quantities derived from the fundamental quantities like energy and entropy by taking derivatives uh the point is that these are quantities that you can measure in the lab which diverge at Phase Transitions and that's their purpose right there's some like thing you can go build a machine to measure and then when the machine goes beep beep because the number is you know getting very large then you are pretty confident that there's a phase transition happening in your system even if you can't really see what's going on and that's it's important to understand that this is why phase transitions are so visible in nature and why we organize physical theories around them and why we use measurements to do with phase transitions to infer microscopic structure that we can't see directly this is a sort of very broad principle in physics gross condensed metaphysics and many other areas of how you probe the subatomic not subatomic but how you probe the microscopic world okay so if these changes in structure which we should accumulate to understand the final weight if these are associated with phase transitions then we're sort of extremely lucky because we expect those changes to be detectable even at large scale with very large models and with lots of data Etc we expect to be able to notice when this happens for sort of General reasons okay so I'm drawing a bit of a plot of training I haven't really stressed that enough I guess um so the idea is that you would be looking during training that's the x-axis on the previous plot as well right um You would look be looking during training you'd be measuring all sorts of things now this observable quantity here could be the overall loss but for example with GPT you do not see phase transitions in the overall loss I guess I should explain them in a phase transition I just said there's a rapid change in something like the Bayesian posterior that isn't really relevant to the loss but in this case uh you would think about the neighborhood in the kind of model that is being represented by the parameter changing in a very small number of steps relative to the usual rate of change so that is the there's a kind of General picture of learning with gradient descent where
often you're kind of reorganizing the way the network is Computing but then over a short period it sort of comes online uh with some new capability and referring back to Matt's earlier comment often that new capability and actually this is spelled out pretty clearly in the induction headscape paper is sort of coming online during the plateau and there's sort of an idea that it just sort of gets hooked up at this point and then really starts being active and that drops the loss so I don't know how solidly to take that but that's sort of an idea that you might have in your head anyway it's relevant what do you mean by coming online here well it might be Computing something but actually not contributing to the prediction at the end right so you might have a bit of the network that's Computing the relevant thing but it's just not used whereas it wasn't even Computing that before that's right yeah yeah that's weird um and then it kind of turns it on and that rapidly uh drops the loss sorry I'm just a little confused they're like gradient descent you need to contribute to the last before you um can change right so how does it is it contributing to the loss like a little bit that's enough for the gradient to just sneak through and then it realizes that like it's a cascading you know fall down hill at that point but it's just like the ball is starting to roll very slowly like at the start there or well I think we shouldn't underestimate the the bias towards lowering entropy that might be present due to the to the noise right um so I I don't know that this is literally an explanation but if you if at gunpoint you force me to give an SLT explanation for what I just said I would say the following um okay so here's here's some level set of the loss and you are gradient descent and you start here so gradient step gradient step you're just sort of drifting around right um but actually not all points of the loss are created equally for the same reason that the Bayesian posterior prefers the deepest Singularity the sort of good reason we don't completely understand it but you can do kind of little calculations and see that it's pretty plausible that the drift will not treat this point unequally it will actually kind of end up drifting around here preferentially now what does that look like from the point of view of the weights well it looks like I'm not actually decreasing the loss right I'm just reorganizing the way I compute at that loss and in what way am I reorganizing well I'm minimizing the
entropy in some sense I am simplifying my solution so that's that's due to the noise though so that's it's subtle I don't think you can understand it there's a big difference I think between gradient descent and stochastic gradient descent when it comes to that story does that make sense yes that makes good sense thanks I don't know if that's true we should try and find out but that would be the explanation I would give okay so during training we're looking at observables um as I said the overall loss say for GPT is not a fine enough grained thing to see phase transitions um so you could look at kind of subsets of that training set I'll call that a sub loss you could look at evaluations so this is an emerging tool for the form of alignment that we have for example with gpt4 they've open AI is open source the library of evaluations that you can submit to where you you know you test for the presence of various behaviors so there's other papers and we've looked at in the AI safety seminar in similar kinds of veins where you're looking for certain behaviors but you can imagine a very large and growing collection of observables I'll just call them in general which might be targeted enough that they actually undergo phase transitions so you would expect that well for example maybe you have a sub loss testing arithmetic and that definitely does undergo some kind of phase transition at some scale I don't know about particular point during training so that's a question mark okay so if you have some observable and you look for phase transitions in those observables and then you kind of Mark the points where those phase transitions happen as being interesting and that's a candidate for one of these critical weights okay so next I'm going to talk about um actually inferring structure from these critical weights uh maybe I haven't said enough about cheap here I mean that seems like the obvious criticism right like okay yeah fancy Singularity Theory SLT you guys are struggling with you know 40 parameters um so it's not clear that you can make any of this scalable uh but that's the reason to organize this around phase transitions for the reason I said earlier around we have good reason to think that phase transitions first of all exist second of all seem to have to do with capabilities third we have good reason to think phase transitions are easy to detect uh and easy means cheaply that is you don't need to be estimating our lcts to know that a phase transition happened okay so that's kind of the argument for
why this might not be prohibitively computationally expensive okay I'll pause here before I do four Rowan had a question in the chat any suggestions the way to read more about Dan's answered and that's question so that was about when you were with the picture on the on the right board ah yeah um no you should write a paper I don't I I don't think anybody has has noticed this uh potential explanation I should say this is uh I have no idea if this is my idea or not I've talked about this kind of thing a lot with Edmund and the others so I we've certainly talked about it for a while so just go on So You is it your hypothesis that that this is the behavior that occurs or you you know this Behavior occurs it's just not written down anywhere uh the bias towards on the level set due to SGD I mean yeah yeah yeah uh I could show you a one-page calculation that kind of makes me feel like it's pretty obvious uh I wouldn't say that's a proof because it's got you know it's it's still just a kind of vague argument um so I wouldn't I wouldn't say that we're certain of that no but I think it's pretty clearly true okay um there's enough concrete calculations floating around for what these level sets look like in toy models that you might be able to get a more concrete sketch of it for a toy model yeah that's right yeah in uh expenses in my thesis for example calculating the various things also um I think people obviously there's a lot of literature on what the what the gradient noise does but um that's not necessarily specifically arriving at this finding or hypothesis yeah that's just that's that's well studied in the literature what would you say Edmund I don't think there's anything that goes in this direction in the existing literature is there yeah I was about to say that that's not a good characterization because all existing literature basically assume that um uh well you assume that things are differentiable and the derivative of the second derivative definition structure reflects their local curvature and things like that and we know that is not true for even for um well we know that it is not true for the true parameters set or the optimal parallel set we kind of also know that it is not true for the objective function itself at its argument at the objective function being a data company loss function um sorry they are um papers that goes that sort of acknowledge that the optimum uh is well not a point or and it's a it's a manifold and study them near a new point of the manifold
but nothing near similarities yep thanks All Right Moving On okay so suppose all that's true then still so you found your critical weights if you could figure out their structure and the early hypothesis is true then you could understand the structure of the final network but then well so you have to be able to actually infer the structure of those local weights the critical weights but isn't that just like kind of begging the question why would it be easier to understand w i star than W star um okay so let me assume that given a critical weight WR star that the relevant change instruction to Wi plus one star can be inferred by analysis of the model sort of near the two of them and this is kind of harkening back to the ideas and the the two talks on them solid state physics so we call such a device a spectroscopic probe by analogy with the devices that would in solid state physics infer the band structure of materials now it's not a Perfect Analogy right because you don't typically construct materials by varying some parameter or like some process like training but sometimes you do right you could think about steel you could think about the magic angle and Twisted by layer graphene um and actually yeah this example of Twisted by layer graphene is doing a lot of work for me here because we know that as we tune the critical parameter the critical angle um that the nature of the Divergence is there really does have a lot to do with the structure of the singularity and uh even the sort of group representation theoretic story there as well so that's that's kind of a pretty good analogy I think okay so here's here's the kind of picture building off the earlier ones so yeah I'm drawing the pictures where stuff is added I don't mean to suggest that's the only kind of structural change it's it's a very poor analogy in some sense um what I actually have in mind is more like defamation of singularities to be to be exact ly okay so suppose the change was attaching this guy um what I'm suggesting is to learn this I don't know to what degree so this is like uh this is like w i star and this gadget here is the structure this is wi plus one star and this gadget here is the structure it would be begging the question if I just said learn the structure of w i star at every step along the way well if you can do that you can just look at w star and and do it at some point actually inferring this thing becomes roughly as hard as solving the final problem so it can't be that you're just
going to directly infer that that's not a sensible suggestion it has to be at least at some point and to some significant degree which grows over the course of training that it suffice us to learn the change the transformation between these two structures I really have no idea why that should be true I hope in the toy models paper to actually show that is the case I think it's not crazy um but this for me is not quite wishful thinking but uh I think I'm pretty unsure about this part relative to the rest the rest seems like okay pretty plausible to me this I'm more uncertain about um this is working backwards if I think this problem is solved in some deep fashion along these lines this has to be true somehow right so uh yeah I I think it we should hope it's true and we should look for it I don't I really don't know um and regarding scalability this step probably depends on something I'm calling the locality hypothesis okay so this is the step where I'm talking about having some sort of device which looks at a weight right you've got training through the training process spits out this checkpoint then you run the machine on the network with the network near that checkpoint and you're I don't know doing some kind of Markov chain Monte Carlo or a high performing scalable approximation of that or whatever right with various members of this audience are working on various ideas in this direction you do something and you infer from those measurements the structure or the change in structure uh okay why should you believe that such calculations can be done in less time than it takes to just finish training the whole network uh well if the change that is involved only affects a relatively small proportion of the directions in weight space well then it doesn't seem so crazy to imagine we can actually do this right I also don't know that that's true I think there's there's some evidence for that that we could talk about some other time but this is the hypothesis it's possible to choose the critical weights such that this phase transition affects and they are relatively small as not billions of directions in weight space this is kind of a hypothesis of modularity not necessarily in the in the sort of landscape of the network like this head does that and that head does this or that layer does this and that layer does that right it doesn't have to decompose in the way the architecture does like the load the sense of locality that is implicit to the architecture and the way the neurons are connected it
doesn't have to be local necessarily in that sense but it is a hypothesis that these individual phase transitions sort of involve a relatively bounded amount of change okay um yeah I do want to draw one more diagram before I open up for comments and questions on this one so so in my head the picture that corresponds to that cartoon is you can imagine two level sets so this is this is a level set of uh so this is what deformation of singularities looks like you deform a function and its geometry might become more or less complicated by components merging or splitting this is one way in which uh the structure of a variety or a singularity I mean here that's that's kind of the singularity is unchanged that's the point I want you to take away from this right so this point here becomes quite a complex Singularity this here is unchanged and I want you to have the idea when I say Locale locality hypothesis that something like this is happening potentially but maybe even at one position that just somehow in some Dimensions you pick up additional components or additional structure but the other dimensions sort of not involved okay questions about this one comments in this example here it becomes important to track both of the structures that merge together beforehand so is that like what you have in mind when you have a picture of the cube and the um pyramid yeah it's a good question yeah I haven't thought enough about it yeah so probably that's right from the geometric point of view the cube and that little tetrahedron um they would have to be pre-existing things that somehow merged that would be a picture of a second order phase transition so you'd have to be kind of tracking both I think probably what you would do is you would find a critical weight so maybe that's w i plus one and then you would play time backwards a little bit or you would look in the neighborhood for uh like an explanatory structure that merged to form this one um that's actually one of the sections in Gilmore's book on catastrophe theory that I was proposing to talk about in in June is about exactly that so when you find a complex Singularity germ you the standard way to try and understand it is to perturb it a little and see if it breaks into simpler pieces so that's okay that's what I have in mind so so if you track that back through the diet the progression that you showed a couple of boards ago um all the way from W star one to W star yeah um you're gonna if you imagine fragmenting the final piece into
its constituent parts do you need to track them off from the start like that's going to be a lot of you know as complex as that Singularity is or that structure is that means it's going to have a lot of components in the early stages of training no I'm thinking more like chemical synthesis where if you're building a huge molecule you sort of you have the big molecule and you're putting bits on it but the bits are much simpler than the the thing you're constructing they're like bricks right you have a house and you're adding bricks but the brick is a brick you don't have to know how the brick was built that in in chemical synthesis I'm not familiar with the field is it like adding molecule individual molecules or atoms well not always so yeah you're right the chemical reaction yeah you could be taking two very long time yeah yes complex things yeah that's right um in that case you probably would need to track it back I guess um but I suppose that would that comes down to this earlier discussion about entropy of paths I guess uh so maybe yeah well we're in we're in completely unknown territory so I have no intuition about what operations are more or less likely and if indeed it's like you're you're going along in training and somehow the training process has been kind of slowly bringing together two very complicated singularities uh and then they they meet then yeah you might have to track back the other one I don't know good question yeah okay so so you've got um yeah I wonder if it's kind of like you you want to have not just one probe that's kind of tracking one structure but like a bunch of probes right from the start the track any emerging interesting structures at the point where they can like still pass them um and then as they start to grow they because you just have like a like a massive team of little probes that are that are watching these things and and then watching them match yeah I would guess I mean this is sort of the next Point actually which is about universality so um you would presume that okay the space of configurations of a large language model is astounding right um so you wouldn't expect that every different part uh you would expect a different um these other parts you might encounter like what am I trying to say that the the number of possible ingredients is not sort of as exponentially large as the number of weights right somehow the uh I'm not sure yeah let me think about it and I'll come back to that question after the next point I think
okay all right um okay I'm gonna move to the next set of boards okay structure is interpretable I haven't said it all of what structure means right drawn some pictures and kind of vaguely said that it might be something to do with structure of singularities and the honest the answers I don't know um but it doesn't do as much good if if that is just as incomprehensible as the original weights right we assume that the structure is thereby inferred so that is these these things are useful signs in the sense of kind of leibniz I guess for reasoning about and interpreting the knowledge slash computations in the network okay that that has a few parts but somehow the the real Point here is that I I don't I don't really believe in going studying a neural network and reading off a circuit or reading off a program or even inferring a circuit or a program I'm sort of proposing the structure we're talking about as maybe underneath that right it isn't directly interpretable perhaps but it's the structure that's there if all the assumptions are true and it's a starting point for building up a an understanding of the knowledge um okay so uh doesn't seem implausible that some of the structures can be put in correspondence with human comprehensible things immediatric it's a human recognizable Concepts logical rules and I guess here I don't know the paper well enough to to know how much I should lean on this example but you know this would be an example um okay so we know some examples of sort of basic tricks that heads are doing I talked about uh one in the precursor to this talk the syntax trees in large language models and sort of end of sentence detectors and all these sorts of things so these are examples of well not structures exactly in the sense I've been defining it but if you were to look underneath those phase Transitions and see what geometric structures are involved there then look to see phase transitions in the performance which match those phase transitions in the observables and sort of put the structure that emerges there in correspondence with the capability that emerges you would have some sort of dictionary um that's what I'm suggesting uh so just to write that out so for example by tracking while observables or evaluations but those things should be human comprehensible right so you track a sub loss about arithmetic or you track yeah the ability to do in context learning and then you associate that with induction heads and then with some geometric structure
um okay and I'll just I mean a different line of Investigation would be relation to circuits or um Network structure in the sense of the structure of the architecture for example the head one does this head two does that um it's worth noting that the kind of structure I'm talking about it's not obvious how that relates to the kind of structure that we mean when we say heads right ahead the division into heads is a particular kind of structure which doesn't relate in an obvious way to the kind of structure I'm talking about that would be interesting to understand um and that's kind of what I'm getting at with the second bullet point okay so you try and relate the structure that you're inferring from these earlier steps to either structure you understand from evaluations or from other sources and I want to note here that this at least on both of these depend to some degree on something that you might call the universality hypothesis it's not a super strong dependence uh I'll just say it out loud so the universality hypothesis asserts that independently trained networks on the same data and maybe even on different data sets at sufficient scale and so on we'll discover similar structures right so in context learning emerges on many different kinds of data Maybe induction heads too as well I'm not sure but you can imagine that there may be a kind of basic soup of primitive forms that emerges more or less agnostic under some conditions to the precise form of the data or the precise model and so on they're a stronger or weaker forms of that hypothesis um I don't think it has to be true in order for what I said earlier to be useful but it's in the connection to the human comprehension part that it seems to be doing some heavy lifting foreign because okay suppose all of it works you get these structures and you can correlate them to Performance in some way but the structures aren't organized according to human Concepts in any recognizable way so it's just there's something going on there and it's clearly structured and it's clearly like maybe in some sense symbolic even some discreetness to it but it's just not a kind of pattern that we can understand it's like less stuff than the original weights but it just doesn't map onto a division of the world into abstractions that make sense to us right now why would you expect anything else well that's kind of what the universality hypothesis says right it says that if you have a big neural network and a human being well they're
in the same universality class so they tend to acquire a scaffolding of Concepts that's roughly analogous and therefore with some effort you can map between them if that were true we might expect it just directly understand the answers we get out of this process I that could be true I think it shouldn't be relying on it um and so I think there's probably an interpretation process on top of this inference so you infer the structures and then you need subsequent steps of inference and machine learning to transform it into a form that is kind of human comprehensible and maybe there's several layers of that um okay and that brings us back to the original picture so if you'll follow me back to the first slide here okay so that's a picture of the machine observing the world learning about it the knowledge is encoded in singularities spectroscopic probes and this sort of process of accumulating changes to discover a construction and the construction being the kind of encoded program in that final trained weight you infer from that some kind of symbolic structure uh which which are these structures and how they fit together and the Transformations that that assemble them maybe that's still too complicated to understand so you you run some other kind of process to understand that and maybe you even need subsequent sort of interpretability steps on that machine um I don't know that that's sort of getting past the point where I can sort of say anything useful I think but that's that's where I where I've gotten to okay so I'll open up for questions I just want to make a few quick fire remarks um I think there are big problems with this for example there's no reason to think that at a given scale and capability your interpretability story like you learn some structures and you have this beautiful language of how they work it's like Sigma Phi Mu Kappa Alpha Tau great and then you go to the next scale of capability and that's the one where you're all going to die but now the internal language is not Sigma Phi Mu Kappa but Omega Theta you know beta Delta okay well uh you know uh maybe it just doesn't stay true across training in some reliable way or maybe when you scale all of this breaks um that's true of many attempts at alignment right and it seems to me the key reason to stick as close as possible to things like phase transitions universality classes things that are by definition robust to scale that's what critical phenomena is about what is true in the limit so that's one reason to stick to this
story I think is that it's kind of your building on the things that are sort of invariant to scale in some deep sense um but but maybe that still doesn't work um yeah the the reason to maybe be a bit skeptical of corresponding with human Concepts is that that's going to rely on human supervision to a large degree right you've got these evaluations they're written by humans and this is the key floor in many of the current paradigms for alignment where you're relying on human supervision it's uh it's likely to break at exactly the moment where these these things start sort of exceeding our capabilities um I also want to kind of point out that the notion of structure here is is unsettled but is meant to be not necessarily the same as current Notions of computational programs or algorithms right so it's not presuming that you're supposed to cash out your interpretation in terms of circuits or Turing machines or anything like that right um and there's an analogy to materials engineering so uh the kind of the kind of processes we're talking about here are not you know fantastically different from the kind of inference about the properties of materials that are used to build semiconductors at work at scale although you know one could one shouldn't push that analogy too far either okay I think I'll stop there and take questions thank you everybody thanks Dan have you thought about it um iterating the structure finding process I mean it may not be necessary of one layer of this is enough but you can imagine expanding on the third board um even lower rows of structure inference um and maybe if there is a simple human interpretable structure you could kind of trip away at it that way like refining and refining you know inferences yeah I think it's I mean given the computational complexity of all these tasks it will have to come in sort of it'll be always matters of degree here right so these singularities you will never the structures you infer from them will always be like low resolution takes on the actual structure and it's just like how much you can trade off how much time it takes to infer the structure and how much resolution you get and yeah that I don't quite understand what what you mean by um so beyond that uh what do you mean could you explain the distinction between what I just said and what you meant yeah I'm not sure that my idea is so fully thought out um one idea is that um if you if you believe that these models are um in practice highly compressible if we only had the
power to be able to compress them um like in in like for example I saw um a paper where they took Bert which was one of the first Transformer models I think a little bit before gpt2 and they um they some some other people made a version of it that was a fraction of the size in terms of the number of parameters but kept most of the performance yeah um so if you imagine um that smaller model maybe would be able to maybe would be easier to their interpretability techniques on it and so maybe there's like an intermediate layout between um between what you've got here as the whites and the structures and there's one where there's like the compressed weights and I guess if the compressed weights have only the relevant structures yeah um then that might make it easier to infer them yeah I think that's a good suggestion yeah you could you could basically it rather than maybe that's to do with the locality hypothesis right I mean locality is sort of like is maybe also a step there where you try and figure out which weights actually matter and which you can ignore and you could yeah maybe maybe View training a student Network or something or all the you'd want to do it fairly cheaply I guess but there could be some process of not just not just throwing away some directions in weight space but maybe in a more intelligent way trying to extract the parts of the model then you actually want to probe yeah yeah make that makes sense yeah so with so with that what I guess I'm thinking about in terms of iterating the structure of inference itself is something like um what if in the end uh distillation or this compression sort of technique training a student network is actually not all that different from what you're talking about in terms of structure inference um and then you could just have like inferring a little bit of structure and then from that you infer a little bit more structure and then yeah you turn this into a grid where you've got the the most complex representation of the weights at the top and the human interpretal representation in the bottom row but in between there's like a more smoothly varying um version of that I really know distillation and and this structure inference are anything alike but maybe they're a little bit of like yeah it makes me think a bit about the kind of I mean the very first interpretability papers on imagenet were about like your gradient descent against the like a a feature activation to figure out what source image most activates that node