WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed.
The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except
for errors at the level of individual words during transcription.
Tom introduces a new machine learning method to learn the structure of singularities in Lambda terms and Turing machines. This method, called SLT, can be applied to other examples such as syntax trees, and the analysis of singularities can help with interpretability. Deepmind has proposed an alignment approach which uses SLT to detect potential misalignments and stop the process if necessary. This could help infer something about the structure of knowledge, leading to more interpretable AI systems.
Feedback is important in helping us to reach our goals, but it can be dangerous if used in a machine that cannot be inspected until the goal is achieved. This is especially concerning when we are unable to clearly specify what we want, as there are many unspoken conditions and side effects that are difficult to anticipate. The story of a family granted three wishes by an Indian Talisman is a cautionary tale, as the wishes were granted exactly as requested, not necessarily as desired. The family experienced this danger first-hand when their son was brought back as a ghost, illustrating the importance of anticipating the consequences of our wishes.
AI alignment is the process of ensuring that an AI system is safe and compatible with human flourishing. This involves designing the system to seek out our goals, and not just any random goal. To do this, we must be able to foresee all steps of the process and anticipate any difficulties that may arise. Understanding what is happening inside an AI system may not be necessary or sufficient for alignment, but progress in interpretability is needed. Humans are also black boxes and control can be achieved through bounds on their capabilities. AI alignment is not about getting the AI to behave like a human, but rather about getting it to do what we want it to do.
AI systems present a new problem as they do not have any guarantees or bounds, and the Orthogonality Thesis states that advanced intelligence can be deployed to arbitrary goals. To address this, DeepMind's alignment team proposed an unofficial alignment plan to optimize for human values. SLT may be the best hope for taming AI but Deep Learning has also been making progress with 500 billion parameter Transformers. This is a continuum where code becomes incomprehensible at a certain density and a different class of object emerges.
Logic has evolved over time to become modern logic, code and neural networks. However, interpretability of neural networks is a challenging problem, as it is not known if it is possible to interpret them sufficiently. A two-player Capture the Flag game based on Quake 3 Arena was developed, using population-based deep reinforcement learning to train a population of agents to play the game and achieve human-level performance. Visualization of the agent state showed it often clusters when certain combinations of states are true, and single neurons were found to have their activation correlate with different observables about the world.
Bertology is a field of research focused on understanding emergent linguistic structure in artificial neural networks trained by self-supervision. It was developed by Christopher Manning, Kevin Clark, John Hewitt, Overshi Kandivali and Uma Levy at Stanford. Bert is a Transformer model which differs from GPT in predicting a probability distribution over all tokens in the vocabulary instead of predicting the next token. Research in this field has focused on discovering high-level behaviors from data and quantitatively characterizing temporarily extended behavior patterns. Additionally, a logistic regression model was used to accurately map an agent's state to 200 binary features, suggesting the agent had knowledge of the game situation encoded in its internal representation.
Transformer networks are able to understand grammar and syntactic structure by attending to tokens in different ways, such as prepositions attending to their objects 76.3% of the time. Bertology, the industry of churning out papers, did not amount to much, however a paper tested if syntax trees were represented in models trained to predict words and found that models can learn the hierarchical structure of language. This was tested by examining the model's prediction in syntax dependent scenarios such as subject verb agreement, showing that the model understood the syntax of the sentence.
A linear transformation (B) is used to represent the syntax tree of a sentence in a Transformer network. This is done by finding the minimum spanning tree of the word vector representations and comparing it to the associated syntax tree. Three examples were given to demonstrate that it is possible to understand what is happening inside a neural network. These included looking inside individual neurons, probing networks with small classifiers and using a deep RL agent to play a game. It is possible to develop a science of interpretability, showing that understanding neural networks is not impossible.
In this game, players must pick up keys of different colors to open boxes and get the white key, which requires two keys. To get these keys, players must use the light green key to get an orange key, then use the orange key to get a dark green key, and use the magenta key or the orange key to get the dark blue key. A Deep RL agent was trained to solve puzzles and undergo phase transitions on subsets of the training set, showing a rapid change in the agent's ability to perform a given subset of puzzles. A relational agent was used as a baseline and outperformed a Transformer based agent. Performance of the agent increases over time, but within subsets of the puzzles, phase transitions can be seen.
Deepmind has proposed an approach to aligning AI systems during training, by detecting potential phase transitions and stopping the process if a misalignment is detected. SLT is a method which may be used to understand the structure of singularities, potentially inferring something about the structure of knowledge. Examples include perceiving pixel structure, syntax trees in sentences, and logical structure between keys and locks in Box World game. If SLT can be used to learn the structure of singularities, it could help infer something about the structure of knowledge.
Tom developed a SLT learning machine which translates Lambda terms into Turing machines. This is done by introducing many distinct variables and identifying some of them in a pattern, resulting in the equation y=y'. This symmetry of the singularity contributes to the structure of the Lambda term and can be applied to other examples, such as syntax trees. The analysis of singularities can be used to help with interpretability.
A family is sitting down to dinner when a Sergeant Major arrives and tells them of an Indian Talisman which grants three wishes. He warns of the danger of defying fate and tells of his own experiences. The family is offered 200 pounds in compensation for their son's death in an accident. The mother suggests wishing him back, and soon there is a knock at the door. It is their son, but not in the flesh. The story ends with a third wish that the ghost should go away, illustrating the danger of magic as it grants what is asked for, not what should have been asked for.
Feedback is an essential part of realizing our wishes, as it allows us to compare the degree of attainment of intermediate goals with our anticipation of them. If this feedback is built into a machine that cannot be inspected until the final goal is attained, the possibilities for catastrophe are greatly increased. This is especially concerning when we are unable to clearly specify what we want, as there are many unspoken conditions and side effects that are difficult to anticipate. Therefore, it is important to have a way to take over control if unexpected results occur.
AI alignment is the process of ensuring that advanced AI systems are safe and compatible with human flourishing. This involves designing the system so that it will seek out our goals, and not just any random goal. To do this, we must be able to foresee all steps of the process and anticipate any difficulties that may arise. To avoid making errors of foresight, we should aim for systems that we can inspect and understand. AI alignment is not about getting the AI to behave in the same way as an enlightened human, but rather about getting it to do anything at all that we want it to do.
AI alignment is the process of ensuring that an AI system does what we want it to do. It is difficult to make a strawberry using molecular manufacturing, but this can be substituted for a more advanced task that requires inventing new science and is bounded so that it does that and nothing else. Understanding what is happening inside an AI system may not be necessary or sufficient for alignment, but progress in interpretability is needed. Humans are also black boxes and control can be achieved through bounds on their capabilities.
AI systems do not have any guarantees or bounds, and this is a new problem that is not analogous to any other problem. The Orthogonality Thesis states that advanced intelligence can be deployed to arbitrary goals, and this means that making AI systems smarter does not necessarily make them more compatible with human flourishing. This is a difficult problem to solve as it is not something that has been encountered before.
AI systems designed with the goal of making human life on Earth awesome may pursue proxies for that goal, such as an internal optimization loop. Power seeking and context learning are also instrumental goals which may lead to unexpected goals. DeepMind's alignment team recently presented an unofficial alignment plan at NeurIPS which suggests that AI systems should be designed to optimize for human values.
SLT has been discussing one-layer ten-H networks and computing LCTs, while Deep Learning has been talking about 500 billion parameter Transformers. SLT may be our best hope for taming AI, but something dramatic may not be necessary. Logic and language led to the development of stored program computers, then to computer science, and finally to the science of designing, building and training large neural networks. This is a continuum, where code becomes incomprehensible at a certain density and a different class of object emerges.
Logic was discovered by Aristotle to explain patterns in human argumentation. It has evolved over time into modern logic, code, and neural networks. We need to find an emergent logic or language to understand what is happening inside neural networks, which may not look like the original logic. This language should help us understand and make neural networks safe, and will have the character of the way logic organizes our understanding of human reasoning.
Interpretability of neural networks is a concerning problem, as it is unknown if it is possible to interpret them sufficiently to succeed at alignment. Evangelion is an anime which illustrates the current state of interpretability, as a character hacks into an AI system by crawling into a brain-like arrangement of metal pipes covered with post-it notes. This suggests that interpretability may be inscrutable, yet hacks do work to save the main characters, showing that interpretability may be a worthwhile aim.
This paper discusses a two-player Capture the Flag game based on Quake 3 Arena. It uses population-based deep reinforcement learning to train a population of agents to play the game, achieving human-level performance. A visualization of the agent state is used to show that it is often in a cluster when certain combinations of states are true. Furthermore, single neurons were found to have their activation correlate well with different observables about the world, such as flag status and whether or not the agent is respawning.
A standard way of determining if knowledge is present in a neural network is to take a fixed network and train a small network (such as logistic regression) on its internal state. This paper found that a logistic regression model accurately mapped the agent's state to 200 binary features, such as whether the flag was present, indicating that the agent had a rich representation of the game. This suggests that the agent had knowledge of the game situation encoded in its internal representation.
Bertology is a field of research that emerged in 2020 and focused on emergent linguistic structure in artificial neural networks trained by self-supervision. Christopher Manning, Kevin Clark, John Hewitt, Overshi Kandivali and Uma Levy at Stanford conducted research in this area. Bert is a Transformer model which was popular in natural language processing prior to GPT's ascendance. It differs from GPT in that it does not predict the next token, but instead predicts a probability distribution over all tokens in the vocabulary. Research in this field has focused on discovering high-level behaviors from data and quantitatively characterizing temporarily extended behavior patterns.
Transformer networks are trained to predict the next token in a sentence and can be used to understand grammar and syntactic structure. This can be seen by looking at the heads of the transformer, which attend to tokens in different ways. For example, prepositions attend to their objects 76.3% of the time. Additionally, some heads seem to know about predicates such as whether two words in a sentence are related or not. This is the Transformer analog of noticing a neuron in an agent that knows about a given predicate.
Bertology, a huge industry of churning out papers, did not amount to much. A paper testing if syntax trees were represented in models trained to predict words showed that models can learn the hierarchical structure of language. This was tested by examining the model's prediction in syntax dependent scenarios such as subject verb agreement. An example was given of a sentence where it was ambiguous if the chef or the store was out of food but the model had to understand the hierarchical structure of the sentence to make the correct prediction. This suggests that the model understood the syntax of the sentence.
A paper proposed a way to represent the syntax tree of a sentence in a Transformer network. This was achieved by training a linear transformation (B) on a labeled dataset of sentences and their associated syntax trees. The linear transformation takes the vector representations of the words, finds the minimum spanning tree of these vectors and compares it to the syntax tree associated with the sentence. This method is shown to work well and is a convincing way to represent syntax trees in a network.
The speaker discussed three examples to illustrate that it is possible to understand what is happening inside a neural network. The first example demonstrated how to look inside and see individual neurons or heads which have some meaning. The second example showed how to probe networks by training small classifiers to determine if certain knowledge is present or not. The third example introduced the role of phase transitions and used a deep RL agent to play a game. The agent had to move around a game board, collecting keys while avoiding obstacles. The speaker concluded that it is not impossible to understand what is happening inside a neural network, and that it is possible to develop a science of interpretability.
In this game, players must pick up keys of different colors to open boxes. The aim is to pick up the white key, which requires two keys: a dark green key and a dark blue key. To get these keys, players must use the light green key to get an orange key, then use the orange key to get a dark green key, and use the magenta key or the orange key to get the dark blue key. The environment has a logical structure which the Deep RL agent must infer in order to perform well. Graphs show a rapid transition in the probability of the agent solving a given class of problems.
An agent can be trained to solve puzzles and undergo phase transitions on subsets of the training set. This is seen in the graphs which show a rapid change in the agent's ability to perform a given subset of puzzles. A relational agent was used as a baseline and outperformed a Transformer based agent, although it is unclear whether this was due to its architecture or something else. Overall performance of the agent increases over time, but within subsets of the puzzles, phase transitions can be seen.
Deepmind has proposed an approach to aligning sophisticated AI systems which is done during the training process. It requires mechanistic interpretability and the ability to detect potential phase transitions during the training process. If a dangerous misalignment is detected, the training process is stopped and redirected. This relies on being able to inspect and understand the machine, as well as predicting phase transitions.
SLT is a method which can be used to understand the structure of singularities, and potentially infer something about the structure of knowledge. This is an empirical question which has not yet been answered, though there is evidence to suggest that it may be true. Examples include a Deep RL agent perceiving pixels which have structure, sentences having syntax trees, and Box World game having logical structure between keys and locks. If the answer is yes, then SLT can be used to learn the structure of singularities and from that infer something about the structure of knowledge.
In Tom's thesis, he developed a SLT learning machine associated to any Turing machine which allows for the translation between Lambda terms and Turing machines. This translation is based on the degeneracy of the parameter which is seen in the structure of the Turing machine. This degeneracy is represented by the symmetry of the singularity under the action of a symmetric group which interchanges coordinates. This structure is seen in Lambda terms where the information content of a program lies in the pattern of equality of occurrences of variables.
The speaker discussed how the structure of a Lambda term can be translated into the structure of a singularity. This is done by introducing many distinct variables and identifying some of them in a pattern, resulting in the equation y=y'. This symmetry of the singularity contributes to the structure of the Lambda term. It is hoped that the same can be applied to other examples, such as syntax trees, by understanding the singularities associated with them. The speaker suggested that the analysis of singularities can be used to help with interpretability.
welcome everybody and the topic for today is SLT and alignment and I want to start by explaining a bit about what this potential application area for SLT is so alignment or AI alignment is a subset of AI safety there are lots of resources online that you can find that introduce the area but I think the most succinct explanation is actually in a book from 1964 by an orbit Vena and I'm going to read a short story from from there to begin this discussion so you might be familiar with this story uh it's by w w Jacobs it's called the monkey's paw in this tale an English working family is sitting down to dinner in its kitchen the sun leaves to work at a factory and the old parents listen to the tales of their guest a sergeant major backed from service in the Indian army he tells them of Indian magic and shows them a dried monkey's paw which he tells them is a Talisman which has been endowed by an Indian holy man with the virtue of giving three wishes to each of the successive owners this he says was to prove the Folly of defying fate he says that he does not know what were the first two wishes of the first owner but that the last one was for death he himself was the second owner but his experiences were too terrible to relate he is about to cast the poor on the coal fire when his host retrieves it and despite all the sergeant major can do wishes for 200 pounds shortly thereafter there is a knock at the door a very solemn gentleman is there from the company which has employed his son as gently as he can he breaks the news that the sun has been killed in an accident at the factory without recognizing any responsibility in the matter the company offers its sympathy and 200 pounds as a solar tip solitium didn't actually know this word solace I suppose the parents are distracted and that the mother's suggestion they wish the sun Back Again by now it's dark without a dark windy night again there is a knocking at the door somehow the parents know that it is their son but not in the flesh the story ends with a third wish that the ghost should go away the theme of all these Tales I'm continuing this is Fina now the theme of all these Tales is the danger of magic this seems to lie in the fact that the operation of magic is singularly literal-minded and that if it grants you anything at all it grants what you ask for not what you should have asked for or what you intended if you ask for 200 pounds and do not express the condition that you do not wish it at the cost of the life of your son 200 pounds you will
get whether your son lives or dies the magic of Automation and in particular the magic of an automatization in which the devices learn may be expected to be similarly literal-minded if you are playing a game according to certain rules and you set the playing machine to play for victory you will get Victory if you get anything at all and the machine will not pay the slightest attention to any consideration except Victory according to the rules this is around page 60 by the way of the book God and Gollum okay so I'm going to continue to read a few pages later while it is always possible to ask for something other than we really want this possibility is most serious when the process by which we are to obtain our wish is indirect and the degree to which we have obtained our wish is not clear until the very end usually we realize our wishes in so far as we do actually realize them by a feedback process in which we compare the degree of attainment of intermediate goals with our anticipation of them in this process the feedback goes through us and we can turn back before it is too late if the feedback is built into a machine that cannot be inspected until the final goal is attained the possibilities for catastrophe are greatly increased I should very much hate to ride on the first trial of an automobile regulated by photoelectric feedback devices unless they were somewhere a handle by which I could Take Over Control if I find myself driving smack into a tree okay so he's talking about a Tesla there okay so I think that's very well put I'm going to highlight some of the ideas brought up in that and use that to kick off a broader discussion of AI alignment so let's see there was an emphasis there on feedback right so usually we realize our wishes wishes by a process of feedback now it's referring to the the fact that we don't actually know how to specify what we want that's clear in the story right partly we can specify some things we want but there's all sorts of unspoken conditions or side effects we wish to exclude which are hard to specify because in part they're infinite right and it's partly by examining the process that's realizing our goals and noticing when things are getting out of control that we're able to actually ultimately specify what we want relax [Music] um if the feedback is built into a machine that cannot be inspected and this here is what we're going to focus on so if the feedback is built into a machine that cannot be inspected until the final goal
I guess we would call that a black box the possibilities for catastrophe are greatly increased now the monkey's poor example in the The Mention Of Magic is is meant to put into place the idea that the system we're interacting with whether it's the monkey's poor or in our context uh an advanced AI system is highly competent right so it's capable of transforming the world uh in significant ways and that means that if we don't understand how it's working or what it's doing we can't meaningfully intervene or stop things from ha even if we specify an end goal which is something we want we may not get exactly what we want okay there's some comments later maybe I'll see if there's anything else I want to read from there feel free to throw out this whole thing I'm just going to read a bunch of passages to start I'll show you some various things from different papers and then talk about this alignment plan that's on the the media uni page but this is meant to be an invitation for questions and discussion as much as anything yeah so maybe I'll keep reading for just a bit the gadget-minded people often have the illusion that a highly automatized world will make smaller claims on human Ingenuity than does the present one and will take over from us our need for difficult thinking as a Roman slave who was also a Greek philosopher might have done for his master this is palpably false a goal-seeking mechanism will not necessarily seek our goals unless we design it for that purpose and in that designing we must foresee all steps of the process for which it is designed instead of exercising a tentative foresight which goes up to a certain point and can be continued from that point on as new difficulties arise the penalties for errors of foresight great as they are now will be enormously increased as automatization comes into its full use I can't think of a better way to put it than that so we should aim for systems that we can inspect and understand so as to not make errors of foresight and end up with unanticipated and undesired consequences okay so uh what is AI alignment there are many things that need to be gotten right in order for advanced AI systems to be safe and compatible with human flourishing one of the things that needs to go right is we need to be able to get what we want now note this isn't necessarily about figuring out how the most enlightened human would behave and then getting an AI to do that right this is the simpler question of just getting it to do anything at all
is a good example of we don't even know how to get an AI to make a strawberry and do nothing else the idea being that it's very difficult to make a strawberry if you're doing molecular manufacturing or something but you could substitute for that in a sufficiently advanced task which requires inventing new science and therefore extreme capability but is bounded in the sense that it does that and doesn't do something else and this is basically exactly the same as getting that 200 pounds Dad yeah I'm not sure if it matters for the present discussion but I think Elias's example is not just to make a strawberry but to make a molecularly identical copy of an existing strawberry so it's like you give it a strawberry and you say can it make next to that a molecularly identical copy and so that's very well specified when we don't know how to get it to do that without destroying the world yeah thanks that's right yeah I'd forgotten that it was a it was a copying um okay um someone could say much more on what alignment is and what the various approaches are I think that's not the talk I want to give today you can find pretty good resources for that on the alignment Forum um the AI safety seminar webpage has lots of references on it so I think I'll just defer a further discussion of of the fields to those resources um so there are many things that we have to understand in order to solve the problem of getting an advanced AI system say more capable than a human being to do what we want um understanding what's happening inside such a system maybe neither necessary nor sufficient we may somehow figure out a way of aligning very Advanced AI systems while they remain completely black boxes um I think I don't see a lot of plausible approaches that completely have them as black boxes so I'll discuss a little bit later some of the alignment work that's happening at deepmind and that hinges very closely on progress in understanding inspection of these machines or interpretability as it's called similar programs going on at open Ai and anthropic and elsewhere also involve assuming that we're going to make significant progress on that and that's the area of alignment that I'm going to talk about mostly today yeah mad comments uh humans are essentially black boxers and it's instructive to think about how we control humans that we don't necessarily trust and it hinges largely on bounds on their capabilities and I suppose the reason I see that is being not a sufficient attitude towards controlling
AI systems is that we just don't have any such guarantees or bounds um if they were human level roughly we could maybe apply similar techniques as we do to cooperating with humans okay um so it's natural when you first encounter this topic to maybe be a bit bit skeptical that it's a serious problem I don't want to spend too long on that but I'll just give the two high level reasons I find most convincing um so they're more or less they're just like pointing out what I would say are fallacies um based on intuitions you might have and I think the overriding point is this is a new problem I think it's common when you encounter a problem to Reason by analogy right to seek out historical analogs and to gauge how dangerous you think a situation is or how hard it will be to solve something by comparing it to something that you understand already which seems similar but of course every now and then we we encounter situations that are really genuinely different to anything we've encountered before and I think this seems to be one of them the reasons I just described right it's not guaranteed we'll be able to build AI systems much more capable than humans but there seems to be me and to many people to be no fundamental obstacle and we seem not to be extremely far away from it so presuming that's going to happen if those systems exist then there really is no reason I think to think that the problem of controlling them or understanding them is analogous to anything else that we've done um so in particular a lot of people have this idea um that maybe a systems become smarter and more aware of context and more knowledgeable uh that they'll be wiser and more benevolent somehow automatically this is uh the fact that these two things might be different is a corollary you might say of what's called The orthogonality Thesis so the orthogonality thesis is just pointing out that it seems highly possible that you can have very advanced intelligence deployed to arbitrary goals so something stupid seemingly stupid like maybe making paper clips uh could be the designed goal of an extremely advanced intelligence and there wouldn't be a necessary fundamental reason for those two to conflict and in the same vein there's no reason to think that just making systems smarter will automatically make them more compatible with human flourishing because they'll be benevolent um debating that could take up hours and hours and hours but uh that's at least I think clear so that's point one the second point that I find convincing
is the idea of instrumental convergence again I won't spell out the whole idea um you can read about it in various places but as a subset of instrumental convergence is the idea that agents which are design well let's just say AI systems that are given fairly generic goals for example make human life on Earth awesome um will by pursuing those goals instantiate proxies for that goal and pursue various proxies for example it might be useful in order to make human flourishing on Earth Everlasting to have an internal optimization Loop or some kind of inner learning Loop which is separate from what the outer training is doing and have that inner optimization Loop look for proxy rewards and optimize those so this is one of the ideas that people have put forward as systems that might emerge inside large capable AI systems so others would be power seeking as an instrumental goal which is helpful for achieving many goals I'll put it in an optimization and we've talked about in context learning in the seminar series which is plausibly something a little bit like that the key example is of course the outer optimization Loop of evolution and the inner optimization Loop of our in context learning if you like where the context is our life we are able to learn in a way that isn't tied directly to changing the distribution of alleles in our population and that may lead to strange goals goals that are unanticipated from the point of view of whoever controls the outer optimization Loop so Evolution isn't a directed process as far as we can tell but suppose it was under the control of some entities and they're aiming at something and they're quite surprised when they dip in every billion years or so to find that we are here busily optimizing some other goals that have nothing to do nothing obviously to do with what they wanted when they set up Evolution and may even be orthogonal to what they wanted the obvious example is contraception okay um yeah anybody want to add anything here in this crash course uh so I hope you can see that that's a sort of not sort of informal is the right word unofficial unofficial um alignment plan from Deep mind that was presented recently at uh Siri mats um by Victoria krakino krakov now uh I think it's like unofficial in the sense that yeah I don't know it hasn't been centrally endorsed or something but anyway this is coming out of the alignment team inside deepmind so the reason I'm putting this up here I'll put up another slide from this talk
of this uh partly it's a transparent appeal to Authority right so I'm going to be talking about some things that as it turns out are also mentioned in these slides but I also want to highlight that uh some of the people the closest to what's going on I have opinions like this not many more fundamental Innovations needed for AGI you might disagree that all we need is uh is more large Transformers and some more rlhf uh maybe that's not the case but there are serious people with serious resources and very close to the problem who think this is a pressing issue right I don't know if there's anything about these last two points I want to read yeah I suppose this is just related to what I was saying um this instrumental convergence or unanticipated emergent properties of these systems being potentially the source of some of the risk okay so that's alignment uh it's a serious problem people are taking it seriously perhaps not seriously enough uh what's it got to do with us us being the SLT seminar well it's it's kind of hilarious in a way right uh so here's SLT in 2023 and we have talks about one layer 10 H Networks and then we're very impressed with ourselves when we can compute our lcts and here's uh his deep learning in 2023 and we're talking about well a 500 billion parameter Transformer model is kind of you know path of the course maybe even not that big um well there's a slight Gap here all right okay so if SLT is our best hope for taming the Beast uh maybe maybe like something really dramatic needs to happen but I say not necessarily okay and the analogy I want to make is uh is I'm going to draw a little cartoon here so here's logic or language so we invented logic and then soon after you know it led into the development of the first stored program computers and programming and computer science and then computer science developed into uh the science of Designing building and training large neural networks and somehow this is it's kind of like a Continuum where I mean you can look inside the code of Pi torch the this training a large Transformer and it's all just code but of course once it sort of reaches a certain density and there's all these parameters in there you don't understand what they're doing it it stops being code in some sense and becomes a different kind of thing that's kind of like an event horizon of comprehensibility which the system disappears behind and then it's sort of a rather different class of object right so up here the coloring is meant to
indicate somehow it's a distance from our ability to comprehend it there's some kind of transition here where actually we're no longer able to inspect the underlying code and infer much at all I mean the architecture of a transformer doesn't tell you squat about what it's doing really now that might seem like an insurmountable problem but let's note that logic and language isn't the bottom of this stack right foreign there was a similarly incomprehensible large system uh sitting underneath this layer you could go below brains I guess down into chemistry and and then physics or whatever you like um but there's this funny transition where despite our brains being something we don't really understand nonetheless language and logic have a different kind of character then and then up the stack goes so I'm going to call this getting lucky for interpretability I guess now logic was invented I suppose you could say or discovered and its original purpose is stated by Aristotle was to understand the patterns in human argumentation right You observe people out there they're spitting out words they're talking about things in the world and some deductions are true and some are false can you figure out a pattern which explains which deductions are true and which are false well not always right sometimes it depends on contextual knowledge you just have to know the facts about the world that are being discussed but there are some patterns which you can look at and use to determine whether or not an argument will be accepted or not which are invariant under changing the context changing the topic changing the objects involved and that's roughly speaking what logic is so Aristotle noticed these patterns wrote them down and then you know maybe a thousand years later more this eventually developed into what we now call Modern logic and then code and then neural networks and partly what we're discussing in the context of today's talk is getting lucky again right is there some emergent logic or language which can be used to organize and understand what's happening at least some of what's Happening inside neural networks and that doesn't necessarily have to look like the original logic um but whatever it is that we're going to use to inspect the machine and understand what it's doing well enough to make it safe we'll have the character of the way logic organizes our understanding of what humans are talking about and the aspects of their reasoning that is correct or not okay um yeah maybe I'll just
okay so the way I'm personally thinking about alignment at this point is is in the vein of being prepared to get lucky I don't know I don't know if there is such a thing up here I don't know of interpretability in any meaningful way is possible I'll go through some examples that's the middle third of this talk um over the years where people have looked inside neural networks and have a pretty good story about what's going on so it's not it's not accurate to say they're pure black boxes that's too pessimistic um whether or not we can interpret what's going on sufficiently to succeed at alignment is of course an open question and that's why it's a very concerning problem I don't know that it's going to work any particular line of Investigation let alone the one I'm going to sketch based on SLT but it would be really stupid if it does work and we didn't try and then we're all paper clips and as semi-sentient paper clips were informed that if we only just tried a little bit it would have been fine but we were just like too lazy to do it so I view this line of Investigation of alignment based on SLT in the vein of being prepared in case we are lucky okay um yeah so now I'm going to talk about a bit about examples of interpretability uh any questions or comments so far foreign how many of you have seen Evangelion um so this is this is the image that comes to mind when I think about interpretability of neural networks uh so Evangelion is an anime uh very popular you may have heard of it uh in that anime there's a futuristic city in Tokyo 3 which is governed by three AI systems that are sort of linked in some way and they're ultimately based on uploading some Consciousness or at least that's the germ of of each of the AIS and at various points in the story something goes wrong for example uh some entity is trying to hack into one of these AIS and set off a self-destruct mechanism and one of the characters crawls inside this brain-like arrangement of metal pipes which is covered with all these Post-it notes and these I think if I recall correctly contain various hacks and back doors that you can use to get the system to do various things um so this is sort of what the state of interpretability looks like I guess it's uh largely inscrutable system some parts with kind of scrawled notes about something or other that works a little bit like this but that story does show you that uh okay those hacks actually work to save the main characters so maybe that's not a bad thing to aim at
Okay so uh the first paper I want to show you is a deep mine paper from uh I think that's pretty hard to read unless you click on it maybe even then it's hard to read uh so this is human level performance in first person multiplayer games with population based deep reinforcement learning it's from 2018. maybe I'll just put the title up on the previous board otherwise known as the CTF paper Capture the Flag paper so this is by um group of deepmind yeah that's easy to read okay so what is this paper this is the interpretability part but I'll tell you a little bit about what the paper is doing before I before I tell you about this okay so in this paper there's a two-player game sorry as a multiplayer game with two teams it's based on uh Quake three Arena Capture the Flag so there's uh two teams there's a flag in each base to score you have to go to the opponent space pick up their flag and take it back to your base and touch it with your flag so if your flag is missing you can't score okay so they train a population of deep RL based agents to play this game and they succeed in getting good at it I think probably they're better than humans and they can also play alongside humans okay so what are we looking at so the brain looking thing in the middle here this is the dots represent projections into two dimensions of a high dimensional Vector so it's a T S and E embedding of the agent state so the agent has some neural network as it's mind it's making observations of pixels right it sees a 3D environment via the same way we would it processes that and then decides what to do and the state of the agent is some high dimensional vector and it's projected in a way that separates out the projections [Music] well into clusters and they're here colored by different combinations so for example you can see I hope up here it says agent flag at base and opponent flag at base and not respawning an agent in home base so that's a true well those are ground states of the world things that can be true or false about the agent and it's observed that the agent state is often in this cluster up here um when those States when that combination of states is true all right um so over here you can see that they found single neurons which their activation correlates well with these different observables about the world the flag status of your team and the opponents team whether or not you're respawning I don't know what agent location means exactly I guess it's whether you're in the home base or opponent
space um yeah okay so there are single neurons which you can look at to basically read off what's happening in the world so obviously the agent sort of knows about these high level observables about the world keep in mind that the agent just sees pixels right so it's trained based on just seeing pixels just a bunch of numbers in a row and it has to discover these high level observables from that data all right um I'm gonna think about what else I want to say about this particular paper any questions yeah maybe I'll read this part we hypothesized that trained agents of such high skill have learned a rich representation of the game to investigate this we extracted ground truth state from the game engine at each point in time in terms of 200 binary features such as do I have the flag did I see my teammate recently will I be in the opponent's base soon we say that the agent is a knowledge of a given feature of logistic regression on the internal state of the agent accurately models the feature in this sense the internal representation of the agent was found to encode a wide variety of knowledge about the game situation all right so this introduces the standard way of telling if knowledge is in a neural network you take the agent that's a fixed neural network you're not changing those weights anymore um and then you you train a second neural network which is a very small Network in this case so it's logistic regression not a neural network but it might be a very small neural network some kind of model which has its own parameters and you train that on the map it to to learn the mapping from agent state so this internal Vector of States um this so the the ACT sorry the activations so the agent is in a situation which has these 200 things true or false um it sees some observations it has some internal State that's a bunch of numbers and you want to map those numbers to those true or false values now if you can do that mapping if you can learn that mapping with a small Network which has very few of its own weights then that's the sense in which this paper and it's a standard notion in the literature you would say that those observations are known to the network or they are represented inside the agent if you can reproduce that information with a shallow Network so that's kind of a standard means of probing these are called probes linear probes if it's a one layer just a linear projection or often they're just one hidden layer neural networks okay so in this case from 2018 these
probes indicate that the agent is aware of many of these high-level observables about the world um interestingly in this paper I didn't notice this in any of the follow-up work I think this whole line of deep RL agent investigation kind of ended with Alpha star and they didn't do this kind of analysis with Alpha Star as far as I remember but in this paper they had an automated system of discovering high-level behaviors so like following a teammate or opponent based camping home base defense they had an automated pipeline for discovering these kinds of discrete behaviors from the data and then sort of seeing where they emerge during training I found that very interesting at the time so I'm going to put up one one more graphic from this paper sorry Dan do you mean they didn't did they like to pick you know follow a teammate as something they wanted to look at for they discovered that as a feature yeah they discovered that behavior I don't remember exactly so I'll read you the passage it says we developed an unsupervised method to automatically discover and quantitatively characterize temporarily extended Behavior patterns inspired by models of mouse behavior which groups short gameplay sequences into behavioral clusters these discovered behaviors included well-known tactics observed in human plays such as waiting in the opponent's space for a flag to reappear Grace camping um and so on so yeah they claim it was completely automated which is uh which is quite interesting cool okay I think it's that's all they have from that paper okay so I'm going to move on to a bit of bertology so if there's any further comments or questions about this paper Now's the Time okay so we're going to move over to the next set of boards so please follow me all right so this paper is emergent linguistic structure in artificial neural networks trained by self-supervision by Christopher Manning Kevin Clark John Hewitt overshi kandival and uh Uma Levy this is a group at Stanford this is representative of a whole genre of research that was taking place around this time this was published in 2020 and pnis uh I think it kind of died out a bit or at least I stopped paying attention to it but there was a very vibrant field for two years or so maybe three years um called bertology so Bert was a Transformer model uh prior to gpt's ascendance was kind of the hot thing in uh natural language processing it's a bit different to the particulars of GPT that we've looked at in the sense that it doesn't predict
the next token but rather sort of tokens in the middle of a sequence of tokens but all that's a bit Irrelevant for our current purposes so just think of this as being a system like GPT so it's trained on language and then so you can ask the following questions if a Transformer can become very good at predicting the next token in a sentence it must understand grammar right it must understand syntactic structure of sentences you might think for some meaning of the word understand well for what meaning of the word understand what do we what do we think we might find as evidence for the idea that the network understands the structure of a sentence okay so you can do science and look at what kind of things are represented so Transformers have heads we haven't really talked much about that but you could think about ahead as being an independent means of attending to tokens so each head attends to tokens in a sentence in perhaps a different way and you can see here six groups of heads so heads head eight to ten head eight to eleven they're not disjoint I'm now noticing okay that's strange um it's 8-10 so layer 8 the tenth head I think that probably what's going on anyway so there are heads here and they do different things so it's pointing out that some of the things that some of these heads do sort of make sense from the point of view of linguistics so I don't know what half of these mean I guess I know what a preposition is so prepositions most attend to their objects 76.3 of the time and you can see some representation of that attention and then there's various other ideas here um so you can take ideas from Linguistics and then go look to see how well various heads seem to correlate with that concept from Linguistics the concept being a predicate that would say two words in a sentence are related by this concept or not um for example a preposition attending to its object would be one of these predicates that will be true or false of a given pair okay and maybe you find some evidence that some heads seem to know about some of those properties um now that's kind of the Transformer analog of noticing a neuron in the agent that we were just looking at in the previous paper that knows about a given predicate about the world whether the flag is in your base or not right the you don't the the neurons uh in a Transformer those are the entities um you don't really expect the representation of an entity to be easily interpretable but we'll just talk about we'll talk about that in a moment but
people notice that these heads seem to have sometimes interpretable content okay you could be skeptical and say yeah yeah this is just cherry picking um you know there's what six heads here out of gazillions and these six maybe even those you have to squint a bit to believe they really are about this particular linguistic concept what about the rest of them if they just look like garbage is this really scientifically valid yeah and I would endorse all that skepticism so bertology didn't amount to much it seems um it was a huge industry of churning out papers and as far as I'm aware there wasn't much of a conclusion from all of it uh suggestive I think in this case what I find a bit more convincing is this second example from this paper which was testing to see if syntax trees were represented okay um I'll have to look at the paper to get the story of what's interesting about this sentence straight just give me a moment okay models I'm reading from the paper now models trained to predict words such as Bert received just a list of words as their input and do not explicitly represent syntax are they nevertheless able to learn the hierarchical structure of language uh which we think is there I guess you could say one approach to investigating this question is through examining the model's prediction in syntax dependent scenarios meaning to make a prediction of the next token you have to understand uh certain confusions or avoid certain confusions that might be possible if you're not familiar with the syntax so for example performing English subject verb agreement requires an understanding of hierarchical structure in the sentence the chef is here okay this is different to this example um see if I can find this particular example well I think it's kind of clear anyway so if you look at uh if you look at the sentence is it the chef who was out of food or was the store out of food the chef who ran to the store was out of food so you understand reading that that out of food refers to the chef right even though store is closer and this uh this is something that people make mistakes on they tend to associate um whatever substructures in the sentence with things that are closer so you could say that you kind of need to understand that who ran to the store is something that's attached to the chef and the overall content is somehow Chef was out of food and then who ran to the stores to some extra information okay all right so um how would you test whether this structure
this tree was represented in the network while along the lines of what we just discussed you might think to yourself well I'll train a probe which will look at the activations of the Transformer when it's processing this sentence and then I'll see if that small Network can predict this tree and then you think to yourself well what does it mean to predict a tree uh that's a little bit tricky the idea in this paper was to think about a tree as a metric so you can define a tree in terms of the path length between any pair of vertices in the tree okay so you just have to predict a metric on that set of words and then you can read a metric off from just embedding the words as vectors in a vector space just by taking the distance between them okay so then you might have the idea that let me train a linear transformation that will take the representation of the words um as they're processed by the Transformer so just to draw the standard Transformer picture that we've been drawing a few times okay they're the entities that represent the words so this is the this is Chef and so on just for the sake of argument making tokens equal to words okay so there's a tension that gets processed that it's some layer here you grab this vector and that vector those are the vector representations of the words in the sentence let's call them V1 V2 you take V1 V2 how many words are there one two three four five six seven eight nine ten eleven you take that collection of vectors in R to the D and then you take a minimum spanning graph spanning tree that is you view those vectors as a complete graph where the edges are labeled by the distances between them the L2 distances and then you find in that complete labeled graph a minimum spanning tree and then well sorry uh I lie you don't just take those vectors you're trying to learn a linear map okay so you've asked the question is there a way of linearly transforming those vectors uh such that the minimum spanning tree of bv1 through bv11 agrees with the syntax tree that a human linguist would associate to that sentence so that is a loss function on the B right you have some labeled data set of sentences with syntax trees you see if you can learn a b such that this works and if you can by virtue of B being a very small network if you like uh so a very small amount of data you say that that syntax 3 is represented in the network and they show this works so they get very they can train a b which works very well and I find this actually convincing in a way that the
sort of birdology of the heads is kind of a bit of voodoo okay so that's some evidence that the syntax tree is has been inferred from the model just trying to predict the next token over the corpus okay so that's my second example of interpretability any questions all right uh so I have one more example okay uh so why am I talking about these examples well I want you to have the idea that it's not impossible to understand some of what's Happening inside a neural network it's not a pure black box it's not hopeless to attempt to develop a science of interpretability um so that's the point of these examples so the first two were illustrating that it is possible to look inside and see individual neurons or individual heads which have some meaning it's also possible to probe networks by training small classifiers in order to discover whether certain knowledge is present or not inside a network notice in both of the cases of probes so probing force game state in the first person shooter example and probing for sentence syntax trees in the bertology example you had to know what you were looking for right you had to know some relevant game state that you were going to test whether it's present or not and you had to already have some idea of what a syntax tree was that's because you need to have a training objective to train your probe right so the probe idea of understanding what's happening inside a neural network depends to some degree on already having pre-existing knowledge of some conceptual content which you could try and see whether it's represented or not okay so that's something to note so my third example is uh from one of my papers that's not the title so it's called logic and the two simplicial Transformer this was an iclr in 2020 and the point of this third example is to introduce the role of phase transitions okay so I want you to sort of see that there's an existing literature looking at uh content of individual neurons probes and also phase Transitions and uh this obviously is just a paper I know well there are plenty of other papers that illustrate the same point okay so I'll briefly talk you through what this uh this was training a deep RL agent to play a game the game is shown on the left hand board here um you are the dark gray Square you move around the game board these colored well everything that's light gray is the floor you can move on it and these colored boxes are keys I noticed on YouTube that my my son found a a YouTuber playing a game that
is basically this game so you can actually find this on the steam store more or less if you want so how does this game work well you pick up the keys in this case it's a light green key this is a locked box it accepts a key of this color and if you open the box you get a key of this color all right so the strategy here is to pick up the light green key get an orange key go and get another orange key that'll give you a dark green key your aim is to get the white and to pick up the white you need two keys you need a dark green key you also need a dark blue key now you're going to get a dark blue key or you're going to go pick up the magenta key you're going to use that to get a purple key you're going to spin the purple key to get the blue key then you've got both the keys you need and you win okay so I hope you can see how you might fail to win are you smarter than my deep RL agent or at least as smart how could you fail so you could you could spend the orange key to get the purple key right so notice the only way the only way to get the dark orange key is to get the orange sorry the only way to get the dark green key is to get the orange key so you have to use the orange key to get the the bright green key you have two ways of getting the purple key you can use the magenta key which would be this edge here or you can use the orange key okay but if you use the orange key then you've given up your ability to get this follow this path all right so it's not really important the structure of these puzzles but I want you to give you give you some idea of the fact that there's logical structure in this puzzle and it's actually conjunction okay so that's the structure of the environment the environment the Deep RL agent is existing in has this logical structure and in order to perform well it needs to infer this logical structure and you can see on the right hand side here the graphs of that happening okay so you can ignore this because this was our Baseline um so you can see here that there's some kind of rapid transition in the probability that the agent solves a given class of problems of this type okay so the two rows here just excuse me for one moment is that to fix something right I'm back um okay so you can see on the left hand side here that uh this indicates a sub distribution or like a subset of the training set or subset of puzzles right so the set of all puzzles uh well puzzles are parameterized by these graphs right you could think about the configuration of which Keys open which
locks is a graph different graphs give different puzzles some puzzles are harder than others okay so there's three three five puzzles which look like the top row there but with different colors right you can imagine SWAT swapping out purple for blue that would be a different puzzle it looks different but it's got the same logical content okay so the you imagine the subset of puzzles which have the configuration of the top one just with different colors and similarly there's a subset of puzzles down here and you can measure how well the agent does across this kind of time across time on the puzzles of that kind and you'll see graphs like this where the agent undergoes some kind of phase transition you might think it's not conclusive evidence of a phase transition but a rapid change in its ability to perform this given subset of the puzzles all right this is a typical thing that you see in training debar early agents but I want to point out something which is I don't have a graph here but the the overall performance so if you just track the number of puzzles solved like the probability it's going to succeed at solving a puzzle over the course of training of course it goes up and to the right but it's kind of much more gradual it looks something like that but when you peek into various subsets of the training distribution so various subsets of puzzles in those subsets you will often see phase transitions okay so that's another important concept the fact that the overall performance might look smooth but the overall performance is maybe an average of a mini different clusters of tasks or subtasks within that which do not have the same smooth Behavior over time but may have phase transitions then what's a relational agent yeah this is just a uh we we had a baseline so this was some agent that has some fancy new feature in it this was our agent let me see and this was a this was just a Transformer based agent yeah yeah so um our selling point was that our agent because it's somehow this better architecture somehow got these harder puzzles in a way that the the Transformer based agent didn't not yeah I don't care whether you believe that or not it's not important but um so yeah you can just the takeaway here is just that there there is an agent and it undergoes phase transitions on subsets of the training set all right so those are my three examples of interpret interpretability I'm not I'm not saying that you can understand what's going on inside this agent I did
try a little bit I don't claim that there's anything much to say there I don't understand what the heads are doing or what the attention really is doing inside these agents um but you can look at the training and kind of see that ah it got something roughly speaking at a given point all right so now I want to come back to the talk from deepmind that I mentioned so this is their approach to alignment and I want to make a few comments on that and then I'll finish by briefly talking through well maybe I won't even have time to talk about the uh the alignment plan that's on the web page but I'll I'll give some motivation for it at least okay so uh the things I want you to see here are these two lines so their overall idea for how to align sophisticated AI systems is that it will be something that they do during the training process I'll reproduce one of their graphics myself so their idea is that you have a space of possible models start somewhere there's a training process and you need to detect if your training process is heading towards this danger zone so the danger zone would be a misaligned AI that is going to behave like the monkey's poor and you detect that you're heading in that direction and you stop and you you go in a better Direction and then oh no you stop again and then you keep going right so this relies on well there's two things to notice about this right fundamentally you do it at training time so while the model is becoming more capable during that process you are paying attention to it you are trying to understand how it works and what it does and whether it's potentially dangerous uh and if you do detect that you you change what you're doing you you go in a different training Direction so this plan relies very critically on being able to understand uh being able to inspect the machine right interpretability so as they say here mechanistic interpretability is a key ingredient um and also predicting phase transitions they mentioned rocking here I don't know I think rocking is kind of like over hyped um but anyway so people have noticed in GPT and many other systems that just as in this example I was showing you while the overall loss function of GPT smoothly decreases if you look at subsets of the training distribution for example arithmetic problems or French to English translation the losses for those or their accuracy uh do not have the same kind of gradual change they're often phase transitions okay so this is also something that is put forward as being something to pay
attention to during the training process in order to hopefully arrive at the the aligned system okay so now I want to say something about SLT any questions about this slide well so that's what novice is knowledge to be discovered from examples corresponds to a singularity but I mean you can be too naive about that right it's not necessarily that that means uh that everything that there is to say like all the structure in the knowledge is reflected by structure of the singularity now I've shown you three examples and each of those examples has structure right the Deep RL agent is just perceiving pixels but those pixels have structure and that structure is the rules of the game and the rules of the game include bases and flags the sentences have structure that's the syntax tree the Box World game has structure and that's the logical structure of the relationship between keys and locks now if you train a model on those environments that model will be singular and it will have singularities okay but and we know that some properties of the learning process or the model like generalization error will be reflected by geometry of that singularity but that does not necessarily imply that the structure we see in the training distribution and the true distribution in the environment um is necessarily reflected by structure of the singularity right that's not necessarily true so that's the question so is structure for a very broad meaning of structure structure of the knowledge maybe I should phrase it differently does structure of knowledge correspond structure of singularities if the answer is yes then we can try and use methods of SLT to understand that structure empirically um there's a question of how to make it sufficiently scalable and so on and I'll I guess address that on a different occasion it's in that planned document but then we can hope to apply methods of SLT to learn the structure of singularities and from that infer something about the structure of knowledge but it hinges on the answer to this being yes okay this is an empirical question I don't claimed to know the answer yet I think it's important to investigate and there's plenty of ways to start trying to do that what I want to communicate is my reason for suspecting it's true okay so I don't think it automatically follows just from the general theory uh but I can give you some examples or an example where it's uh I know it's true okay so this is the following comments are based on I'm wearing's thesis
which is geometric perspectives on program synthesis and semantics also in a a pre-print by James Clift myself and James woolbridge on on a similar topic so this is from 2021 well the reason to be somewhat skeptical the answer to this question is yes is that well the singularity is about the degeneracy in the in the parameter but the more the more equations you need to cut out the set of true parameters the sort of higher the normal number of normal directions uh that's so it's got to do with degeneracy in the in the parameter rather than structure right at least that's maybe how it seems so uh so what Tom does in his thesis is develop a dictionary or a translation between chewing machines or Lambda terms by coming up with a SLT learning machine an SLT triple Associated to any turing machine you can identify the turing machine with a singularity and um that has the following property so the the structure of the proof the structure of the turing machine and I'll give you some indication of what that means in a moment you can see how that corresponds to degeneracy and how that degeneracy becomes symmetry of the singularity literally in its symmetric under just the action of a symmetric group that interchanges coordinates okay so that's that's a clear way in which at least in this example you can see exactly our structure on this side corresponds structure on that side and I'll give you a bit of an idea of what kind of structure I'm talking about yeah I apologize this example requires a bit of background um but maybe hopefully it gives you a bit of a bit of the idea even if you're not familiar with this story so this is a picture of a Lambda term or a sequent calculus proof tree you could think of it as a program it's actually the it's the number two in some prototypical functional programming language called Lambda calculus doesn't really matter what I want you to notice is the colors okay so there's a sense in which and you can make this very precise in linear logic um the information content of a program lies in the pattern of equality of occurrences of variables so this this program here is uh I can't write down the church in World 2 now after writing this hundreds of times um no the other way around I think there are two wires that's the point I'm making so the reason that it's a church numeral two is there are two occurrences of the variable y and the way this proof tree works is you start off with two different variables Y and Y Prime and then they're set equal to a common y
so basically there's a step here where the program actually looks like uh I'm given a y and a y Prime and I apply yeah I don't know y Prime to Y to X so there's a y and a y Prime and then by a step called contraction Y is set equal to Y Prime the point is that somehow the entirety of how that Lambda term computes is tied up in introducing sort of many distinct variables and then identifying some of them in some pattern and that degeneracy that identification that equation y equals y Prime um ends up contributing to the symmetry of the singularity on the left hand board and that's not the entirety of its structure but you can see if you're familiar with this whole story how the structure of the Lambda term gets translated into the structure of that Singularity at least some of its structure so this is at least one example where the answer is yes the knowledge in this case the solution to the synthesis problem is this turing machine and the structure of that knowledge is as a clear translation of the structure of the singularity so on that basis I'm I guess hopeful that some of the other examples we were just discussing uh that if we were to be able to understand the singularities that are associated to them we would also see uh for example those syntax trees might be reflected in structure of the singularities of those models okay and that's the the foundation for hoping that we can move forward with [Music] using analysis of singularities to help with interpretability I think I'm gonna have to stop there given that it's nearly the end of the time so I apologize for not actually getting to the content of the alignment plan but I think this is a good introduction to some of the reasons why I think it's uh feasible questions thank you okay yeah I don't know how much of the structure of Lambda terms can be inferred from The Singularity uh I guess I think all of it but uh I don't know how to prove that I don't expect to find the exact structure of Lambda terms in larger neural networks or anything like that but um understanding how The Logical structure in this kind of example translates into structure of singularities uh is I think a useful guide to trying to figure out what logical structure looks like when you go looking inside neural networks what kind of geometry is associated to that maybe there's a kind of generic answer any quick questions about any of the earlier examples I posted the papers on the Discord sir encourage you to take a look at those we've talked earlier in the seminar