Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability

Transcript

[00:00:00] Presenter: Welcome, Chris.

[00:00:02] Chris Olah: Thank you. So, it's really a privilege to be here speaking to you today. And before I came, I spent some time sort of asking myself, what would it really be useful and valuable and interesting for me to say to you today? Because it... Why did it go back to... Wait, this is totally not what I am presenting. I mean, this is my slides, but it's not on the...

[00:00:37] Okay, there we go. We are on the right. No, no. There is a great delay between... So, yeah. So, I was really wondering, you know, what would it be valuable and useful for me to talk to you about? And it felt to me like one thing, which was very natural as well, you know, I could talk a little bit about what mechanistic interoperability is about. Because, you know, for those of you, there's a lot of different kinds of interpretability work going on. And for those of you who are less familiar, I could try to give you a bit of a flavor and that might be useful.

[00:01:05] But it also seemed to me, and this is a little bit more awkward to talk about, that it might be useful to talk about, you know, the extent to which this work is bullshit. Because, you know, I was very struck a few years ago. I had a colleague who I'd worked with for many years, and one day, they just matter of fact told me that, you know, obviously, all interpretability research is bullshit. And that really stuck with me. Because it seemed to me that probably, you know, it was unusual for somebody to say such a thing so sort of so bluntly and directly. But, you know, it might be the case that quite a few people believe that. And to the extent that you believe that, I can't say anything useful in this talk, because you will not believe it.

[00:01:47] Unfortunately, I think this is going to be difficult for me to address in a talk format. But it turns out that I'm going to be doing some kind of sort of office hour-like thing later. And so, to the extent that you sort of are skeptical of this kind of work, I would not be insulted. I would be privileged and delighted if you were willing to come and sort of trust me to talk about your doubts and see if we can make some progress towards the truth together.

[00:02:10] The third thing I wanted to talk about is a challenge we call superposition, which increasingly seems to me like the question on which the impact of mechanistic interoperability for safety is going to rise or fall. And I want to highlight it because it seems like such a central and important question to me, but I also want to highlight it to you because I believe it is an attention, a question that might be worth your time. Not only because of its importance, but because I believe it's a question that is very amenable to research without access to lots of compute, and which also has very rich mathematical structure.

[00:02:44] So, I will talk about some other things. I'll talk about some interesting fun results at some points. But these are really the three things that I felt like I could say that might be useful and important to you.

[00:02:55] Okay. So, we'll start with a little bit of an introduction to what mechanistic interpretability is.

[00:03:03] There's like a 30 second delay. So, in regular software engineering, you know, we have planning documents, or we might write some goals, or have some kind of design doc for our software, and then we write some source code, and then we compile that source code into a compiled binary.

[00:03:22] But for neural networks, you know, we might go and have we have a training objective that we're optimizing. That's our goal. And then we sort of directly turn that into neural network parameters into a trained neural network, and there was never anything like source code in the middle. And so, really the goal of mechanistic interoperability is to somehow take those neural network parameters and turn them into something like source code.

[00:03:44] And I think this is actually a pretty deep analogy. So, you know, we can think about this in lots of ways. Computer programs have variables. I think in a neural network, that's roughly analogous to something like a neuron or a linear combination of neurons that represent a feature. A computer program has a state. I think that's kind of analogous to the activations of a neural network layer. The neural network has a processor or a VM that it runs on. That's, I think, the neural network architecture. And then we have at the end this binary, and that's kind of analogous to the neural network parameters. But the thing that a computer program typically does have that a neural network does not is the source code. And so, that's what we would like to get.

[00:04:25] I think an interesting point here is that there's middle ground between trivial and impossible. So, I feel like when we're talking about interpretability, we often, I hear people either sort of treat it like a thing that should be trivial and easy, or a thing that's going to be impossible.

[00:04:40] But it seems to me that that's sort of a strange dichotomy. And it, in fact, feels to me, again, this is just it's like, oh, come on. Update slide, please. Please change. But it's on the right thing on the laptop. The laptop is going and switching. It's just not showing the right slide on the, yeah.

[00:05:06] Well, I guess I will try to talk without my slides. It seems to me that it might be the case that interpretability is merely very hard but not impossible, and that you can have something between trivial and impossible. So, I often imagine that even sort of any interesting real neural network is probably harder to understand or, here we go, or reverse engineer than, say, a very complicated program like the Linux kernel. Like if I was trying to reverse engineer the Linux kernel without knowing anything about it, I mean, I don't know much about reverse engineering normal software, but that seems like it would be a very hard challenge. And I suspect that that's when I talk about trying to reverse engineer a neural network, I think we're talking about a very significant challenge of that kind, and that it might be possible. Similarly, no one's like, oh, biology, it's not easy, therefore it must be impossible for us to understand cells. There's room for something to be merely extremely difficult in between.

[00:06:05] And that being the case, we'd like something that could maybe make it a little bit easier. And so, I think the goal of trying to understand an entire neural network is very hard. But we might be able to understand just portions of neural networks and then gradually grow that little portion. So, we might be able to go and slowly sort of and rigorously grow out from understanding tiny little portions to understanding larger chunks.

[00:06:27] So, something that we could talk about is we could try to make this a little bit more concrete by talking about a particular kind of neural network. And I think that ConvNets are maybe a good place to start. I was very fortunate for a number of years to work here with our hosts leading the interoperability team here at the time, and we made a lot of progress on understanding ConvNets.

[00:06:51] And I think a way that I like to think about this is in terms of three kinds of objects. So, we have features, which are, say, the easiest way to think about this would be a neuron. It might be something like a car detector, or a curve detector, or an edge detector, or a floppy ear detector, some kind of articulable property of the input. Weights, which connect together features. And then circuits, which are the combinations of weights and features together.

[00:07:22] So, as an example, it turns out that in InceptionV1, there's all these curve detectors. In fact, these exist sort of in every vision model we've looked at, they look for sort of just curves, and it turns out these hold up to extremely rigorous investigation. In fact, we wrote an entire two papers just on curve detectors. And you can do all kinds of things to test these really are curve detectors, up to the point of going and rewriting the entire circuit from scratch from understanding it and re-implementing them.

[00:07:52] And so, that's very interesting because that gives us a way to think about one unit inside things. Now, another thing we can ask about are weights. And remember that in a ConvNet, the weights are going to be a grid because we have to talk about relative positions. Does it excite something at this offset or that offset?

[00:08:06] But if we just look at the weights in isolation, they're not that interesting, right? We have the input neuron and the output neuron, but yeah, the weights by themselves, they don't say that much. But if we then go and connect them to features that we understand, suddenly that becomes very different.

[00:08:21] So, here we have a dog head facing to the right with a long snout. And on the other side, it connects to a dog head plus neck detector. And you can see that now all of a sudden the weights make a lot of sense. We're attaching, we're having the dog head detector go and excite the dog head plus neck detector, if it's on the right side of the dog where the head should be. So, that's very interesting. You can put it on the other slide.

[00:08:46] And it turns out that once you contextualize weights like this, neural network weights are full of structure. So, we have dog heads being attached to necks. We have pose-invariant dog head detectors being constructed from dog heads facing in different orientations. They converge so that the snout is in the same place. We have a car detector that goes and looks for windows on the top and wheels on the bottom. Yes?

[00:09:20] Audience member: [Inaudible]

[00:09:23] Chris Olah: So, these are all feature visualizations that are created by optimizing the input. So, you go and you take random noise and you optimize them to go and cause the neuron to fire. The claim here isn't that that is like evident, that that is like decisive evidence about what a neuron is doing, it's more like a variable name that we can put there. Because being like this is 4C447, that's not a very useful, I mean, I spent so long with this model that I know what 4C447 is, but you wouldn't. And so, having a variable name that's very suggestive of what it does and a little bit of evidence, I think there are important ways in which feature visualization provides evidence for what a neuron is doing, is helpful for representing these circuits. Of course, if you really want to be confident, you want to look at all kinds of things and do the kind of detailed investigation that I described for the curve detector.

[00:10:12] We have high-low frequency detectors. And here, these units, they go and they look for high frequency patterns on one side of the receptive field and low frequency patterns on the other. And you can see actually that as they just do that by going and piecing together a bunch of neurons that represent high frequency, and a bunch that represent low frequency in the previous layer, and as the high-low frequency detectors rotate, so do the weights. So, there's actually just this beautiful structure. Here, we have a bunch of color contrast detectors being used to create, assembled into center surround detectors. We have a more interesting circuit that does black and white versus color detection, going and constructing black and white versus color detectors and assembling them into, again, these kind of center surround units.

[00:10:58] Curve detectors are built from earlier curve detectors. And you can look at, as they're assembled, they look for the earlier curve detectors, excite the later curve detectors when they match the orientation and inhibit it when they have the opposite orientation. And again, the weights rotate as the features rotate.

[00:11:13] Here, we have a triangle detector being assembled from edge detectors. We have, wait for it, a small circle detector being assembled from very early curve detectors. We have a whole bunch of color contrast detectors being used to create lines.

[00:11:31] And the thing that I want you to take away from all of this is these are just a handful of examples. Neural network weights, once you start to contextualize them, are full of structure. Also, lots of things we don't understand, but lots of structure. Sophisticated boundary detectors being constructed from all kinds of different cues. So, all kinds of things.

[00:11:52] Okay, so that is the basic picture. Now, unfortunately, yes. Sure.

[00:12:05] Audience member: [Inaudible]

[00:12:21] Chris Olah: Sure. Yeah, yeah. This is great. So, my response would be that these visualizations are just variable names. The actual thing that's analogous to i=1, or something like this, is the weights. The weights are the assembly instructions or the code of the computer program. And so, here, we're defining a car detector in terms of a window detector and a wheel detector and the car body detector.

[00:13:07] And then you could ask, okay, how do we trust those? Then we need to go back another step and so on. Then it starts to look a lot like a computer program where we have an understanding where we can base it on our understanding of previous variables. And if we want to go and understand those and really carefully understand those, we have to go back further and further and further, and we can trace things back all the way to the input if we want. It's a lot of work. But, yeah, I think that's the answer.

[00:13:27] So, these are just functioning as variable names. They're useful variable names. They do provide certain kinds of causal evidence.

[00:13:34] But I see that we're getting a lot of raised hands. I'm kind of tempted to ask to hold questions maybe until the end, because I do want to cover a lot of ground and I'm a little nervous that we won't get through it. And again, I'll also be available for sort of office hour type things to go and answer any questions you have that I don't address during this talk.

[00:13:50] So, I want to now talk about superposition. If the slides will update, which they might not. And you see, the picture that I painted for you so far is a bit overly optimistic in an important way, which is that neural networks are also full of what we call polysemantic neurons, neurons that respond to many unrelated features.

[00:14:12] So, surprisingly, in some ways, it's miraculous that neural networks, when they have a privileged basis, when they have neurons, they have an activation function, that you get lots of neurons that just seem to correspond to meaningful features. But you also get many that don't. And one hypothesis for why this is, is that the features are in superposition.

[00:14:30] The model wants to represent more features than it has neurons, possibly many more features than it has neurons. And as soon as you have that, of course, you can't align all the features with neurons because there's only n neurons and you want to represent more things. And so, this is a very frustrating situation.

[00:14:46] And in fact, the picture gets kind of crazier, where if you take that really seriously, what it starts to suggest is that the model is actually sort of simulating a larger neural network. There is some larger, sparser neural network, and then it gets projected down and folded on top of itself to go and create the network that we actually observe. And then, of course, the neurons don't correspond to features, they correspond to weird linear combinations of features.

[00:15:11] And so, it's hard to really directly study this in real models, but it turns out we can show that this is exactly what happens in toy models. And then there's suggestive evidence that this happens in large models.

[00:15:24] So, it turns out that the essential thing is the sparsity of the features, where if you have a bunch of features of varying importance, and then you can see a detailed description of this experiment in the toy models paper, if you just start at zero percent sparsity, if the features are just completely dense, you only get as many features as you have dimensions.

[00:15:43] But as you make the features sparser, then because they're probably not going to go and co-occur and interfere with each other, the model starts to be able to go and pack more things in. And eventually, you end up with five features in a two-dimensional space.

[00:15:56] Turns out you can actually do computation as well in superposition. So, you can actually, in some sense, have a computational graph that's higher dimensional, project it down, and actually do useful computation while holding everything in superposition.

[00:16:07] So, this makes it very hard, when this is true, for us to go and understand things. And then what you start to think when you observe these neurons that are sort of monosemantic, is that those are particularly important and common features. And then the things that we don't observe are, the sort of less important or sparser things, are in superposition.

[00:16:26] So, I think this is a really fundamental challenge to this kind of work. Unless you understand the superposition structure, there will always be unknown, the potential for unknown behaviors to just sort of suddenly occur. And this is something that I'm very worried about. You won't be able to understand really what the weights are actually doing.

[00:16:43] So, this seems to me like sort of the challenge for mechanistic interpretability right now. And I think that a lot of our success for doing useful work is going to rise and fall on whether we can make progress on that problem.

[00:16:57] And I think it probably holds for a lot of other approaches as well. I don't mean to say that nothing can be done with that, I think there are other valuable things one can do within mechanistic interpretability without this. But I think that we will surrender a great deal if it turns out that we can't go and address this somehow.

[00:17:14] It also turns out though that superposition is full of beautiful mathematical structure. It is deeply connected to compressed sensing. It turns out that in toy models at least, these features organize themselves into polyhedra, into regular polyhedra, which is kind of wild, or uniform polyhedra. It's very not obvious that they should do this. It turns out that the learning dynamics involve these weird electron jumping-like behaviors that are very strange. It's very mysterious.

[00:17:39] I see that there is a question. If it... Okay.

[00:17:53] Audience member: [Inaudible]

[00:17:57] Chris Olah: Yeah. So I actually think that defining what we mean by a feature is extremely hard, and I don't have a super compelling definition of what a feature is. But intuitively what we mean is something like a curve detector or a carve detector or a car detector or things like this. And sometimes those end up corresponding to neurons, but sometimes we can show that they correspond to linear combinations of neurons. So we can construct toy problems where we do know what features are and where that's exactly defined, and then they will go and represent them in this way.

[00:18:27] So I think the toy model is probably the best explanation I can give of this in a lot of ways. Because in that case, we know exactly what a feature is and it does just go and play out this way. I think what a feature is sort of in a completely general model is harder to go in and say. Especially, like one definition you could give is that a feature is sort of a human-understandable property of the input that the network represents. But I think that that kind of... I don't want humans to be sort of involved in the definition.

[00:18:52] Another... The thing that I sort of in my heart believe a feature is, but I don't know how to properly define this, is something like a feature is like a fundamental unit of the neural network's computation. And those don't necessarily correspond to neurons. They sometimes do, and they often appear not to. Often it seems like they instead are represented by these linear combinations of neurons.

[00:19:10] But I think the toy models paper where we just set up the problem such that there is an obvious thing that is a feature, and then you get this behavior, is the sort of the best demonstration I can give of how those things can decouple.

[00:19:24] Okay. So I wanted to talk briefly about how transformers are different. And I'm only going to talk about this for a little bit because this actually becomes quite messy and ends up with a lot of detailed mathematics. Which I think it's actually very cool mathematics, but it is a little bit harder to go through in a talk.

[00:19:37] But I'd highlight two things in particular that make transformers very different. Well, maybe there's a third one, which is that you just that you see a lot of superposition in transformers. You see a lot more than you do in vision models. But I think there's two very deep architectural differences.

[00:19:51] So one is that we have a residual stream. And it's a little bit even different from a ResNet where in a trans, well, I'll talk about this more in a second. And the other is attention ads. Oh, no, apparently I'm not going to talk about that more in a second.

[00:20:03] So the residual stream, just because you directly add and sort of project in and out of it for all of your layers, there's sort of these implicit latent weights connecting neurons across layers, or attention heads across layers, or all these things. And so that creates a lot of interesting structure that doesn't exist in quite the same way in any ConvNets really. Because even ResNets aren't exactly like this type of thing.

[00:20:32] There's also attention heads, which we can talk about more. And attention heads kind of create a new sort of fundamental unit of mechanistic interoperability, similar to feature weights or activations. You might think of these as attentional features. And I won't talk about this too much, but there's lots of very rich and interesting attentional features. And this is one that I find particularly interesting. If you look at [inaudible], these weird off diagonals, they're very striking if you look at attention patterns.

[00:21:05] And it turns out those correspond to a particular type of attention head called an induction head, which searches through for previous cases where something happens, and then looks forward one step and then goes and copies that and goes and outputs that. And this allows the model to sort of do in-context learning of a kind.

[00:21:23] And in fact, it turns out that sort of a generalized version of this seems to be the driver of in-context learning, or at least there's a non-trivial amount of evidence for that hypothesis.

[00:21:32] In fact, these are so important that they correspond and they create a bump in the loss curve when you train transformers. It's just a visual bump in every transformer loss curve that I've seen. If you have more than two layers, because that's what's necessary for induction heads to form, they also cause actually a deviation in the scaling loss. So if you look at the scaling loss, the original scaling loss paper, you'll see that there's a point where there's sort of a divergence from the original trend, and that corresponds to the models not having induction heads and then forming them.

[00:21:57] So they seem to, this is kind of interesting, these induction heads are so important, they appear in the macroscopic picture. We have this microscopic picture that we're developing here and it bridges all the way to the macroscopic.

[00:22:09] Okay. Now, so far I've just been talking about interpretability sort of broadly, but it's worth maybe briefly painting a bit of a picture for how this might connect to safety. Yes, Joshua?

[00:22:20] Joshua: [Inaudible]

[00:22:21] Chris Olah: Oh yeah.

[00:22:22] Joshua: [Inaudible]

[00:22:28] Chris Olah: It does. Yeah. A massive number of induction heads simultaneously occur. There seems to be a deep reason for this. Basically, there's a number of pieces that have to exist for an induction head circuit to form, and once you start to have those ingredients, you then get an extremely fast feedback loop, and then all of them form simultaneously because the ingredients are in place.

[00:22:53] And they perhaps evolve a bit further from there, but there's this sort of sudden place where all being, like, if you look at, if you just track the existence of induction heads in a transformer over time, they will just sort of all form in a very small window. And maybe it may be further evolved, or maybe you'll still like get a couple that form afterwards, but the vast majority of them form at once. And so-

[00:23:11] Audience member: [inaudlble]

[00:23:13] Chris Olah: Yeah, we have a slightly similar experiment, where you can go and set up an architecture that very easily learns induction heads, and then induction heads form right at the start. So you can, I haven't quite visualized the loss curve, but it's like right at the beginning. So it'd be in the extremely steep regime and you won't be able to see it as clearly, but it's like, it would happen like right here.

[00:23:32] And then you don't see an analogous, I guess we can, we can check this, but I, yeah, you shouldn't see an analogous bump in the loss curve.

[00:23:42] Okay. So I want to briefly talk about how, so there are lots of ways in which interpretavility might contribute to safety. I just want to talk about one that seems to me like a particularly important one, and one that I find very motivating.

[00:23:59] And so it seems to me like the thing that we most want out of safety is something like the ability to make statements of a kind, you know, something vaguely like, for all the situations the model could be in, the model will never deliberately do X. We want to be able to say something like that.

[00:24:15] And my present best guess as to how you could achieve that is that you want to be able to say something like, okay, there don't exist features, let's say, such that the model will deliberately, that will participate in the model deliberately doing X. And that's going to then be some kind of claim about the circuits that feature participates in.

[00:24:30] So this is a very ambitious and wild and kind of crazy thing to be aiming for. I don't mean to say that this is anywhere near being able to go and make this kind of statement, but this is kind of a spiritual North Star, I think, for the most ambitious way that mechanistic interpolating could help with safety.

[00:24:47] And I think there are two major challenges to this working. The first is superposition. How can we actually access the features? How can we know that we've got everything? How can we rule out these places where the model might have something, a feature that activates very rarely and it's just some direction in X, and it's not very rarely, and it's just some direction in angular space where it behaves totally differently, how can we address that?

[00:25:04] And then separately, scalability. If these models are so large, and the success of the mechanistic interpretability so far are so small, there's these tiny little circuits that we're understanding. How can we hope to get to these large models?

[00:25:19] So I am very worried about superposition. I am more optimistic about scalability.

[00:25:27] I think that there are a bunch of really good ideas for approaching scalability. These include trying to use AI to automate things, I think trying to exploit modularity or graph structure in models, and I think there might also be sort of interesting motifs. We've seen this in vision models that can massively simplify models once you understand them, if you start to understand certain symmetries.

[00:25:49] Unfortunately, I think this is a really challenging problem to work on right now, because we don't actually have access to the units that we want to understand. And we're sort of trying, you know, if we think the features are there, and they're all in superposition, they're a giant mess, and it's very difficult to start working on how you can actually really effectively scale.

[00:26:06] But I do feel pretty optimistic about this. And my real fear is this challenge of superposition.

[00:26:14] Okay. I guess I'm getting to the end of my allotted time, so I'm right on schedule. I wanted to briefly just talk about one more brief thing, which is so far, I've talked about interpretability, and I sort of talked about why you might care about it for safety.

[00:26:28] But I feel like at an emotional level, something I also find really motivating isn't just this goal of safety, but this belief that I have that neural networks, if we take them seriously as objects of investigation, are full of beautiful structure.

[00:26:43] And I sometimes think that actually maybe, you know, the elegance of machine learning is in some ways more like the elegance of biology, or perhaps at least as much as like the elegance of biology, as math or physics.

[00:26:55] So, in the case of biology, evolution creates awe-inspiring complexity in nature. You know, we have this system that goes and generates all this beautiful structure.

[00:27:13] And in a kind of similar way, gradient descent, it seems to me, creates beautiful, mind-boggling structure. It goes and it creates all these beautiful circuits that have beautiful symmetries in them. And in toy models, it arranges features into regular polyhedra. And there's all of this, just like, it's just sort of too good to be true in some ways, it's full of all this amazing stuff. And then there's messy stuff, but then you discover sometimes that messy stuff actually was just a really beautiful thing that you didn't understand.

[00:27:42] And so a belief that I have, that is only perhaps semi-rational, is that these models are just full of beautiful structure if we're just willing to put in the effort to go and find it. And I think that's the thing that I find actually most emotionally motivating about this work.

[00:28:02] Yeah. So thank you all for your time. I want to emphasize that all this work was done in collaboration with many others, including others at different institutions. I'm at Anthropic. And if you're interested in this work, you should look at the original circuit thread, which made a really serious attempt to reverse engineer one vision model, InceptionV1; the transformer circuits thread, which has been trying to do sort of analogous work on transformers; and Neel Nanda has this really nice annotated reading list, which actually might be the best jumping off point.

[00:28:36] So thank you. Great. Yeah. Yeah.

[00:28:55] Audience member: [Inaudible]

[00:29:09] Chris Olah: Yeah. So my answer would be, my thinking on this mostly comes from this, this paper that I was lucky to work on with a number of people at OpenAI. So you might be familiar that, you know, in the original [inaudible] 2012 paper had this really striking phenomenon, where you look at the first layer of the model and you see that there's all of these black and white Gabor filters on one side and color contrast detectors in the other. The model is set up such that you have these sort of two GPUs that only talk to each other every other step. And it turns out if you look at the next layer, you see similar things.

[00:29:57] And it turns out this is actually a very general phenomenon. Lots of models, vision models, have branches, and like InceptionV1 has branches. And like, it turns out that like this branch over here has all of the like black and white versus color and color contrast units, this one's full of curve detectors and some sort of fur and eye related detectors and some boundary detectors, and this one's full of 3D geometry detectors.

[00:30:20] And just like the probabilities are like, definitely it's definitely going and organizing things like this. So one thing you could say is, okay, well, we could create branches like that. But then it actually turns out that if you look at other models and you can just visualize which features say in the first layer- oh, I think this is my timer to remind me for the, I didn't even know that it made a sound when I ran out- it turns out that the structure is implicitly there. You can just go and do, take the magnitude of the connections between features in one layer and features in another layer, and go and do a singular value decomposition. And you discover that actually all the same structure was implicitly there. And so I think this kind of thing is just there if we're, if we, if we look for it and we can just sort of probably pull it out.

[00:31:04] In practice, this becomes hard in the later layers of vision models, you don't see this as much. And in language models, we don't see it. But I think that's because of superposition. I think the thing that superposition sort of, the best case for superposition is to go and fold two completely unrelated computational graphs on top of each other. And so my guess is in fact that, and in fact, one of the things that I'm sort of most optimistic about, is that if you can solve superposition, you'll discover that there's, that there's just lots of like implicit modularity and other kinds of organizational structure to these graphs that we can just see. Because I think the early layers of vision models are kind of the best proxy I have for a model that has very little superposition. And that seems to be the case there.

[00:31:45] Yes?

[00:32:43] Audience member: [inaudible]

[00:32:55] Chris Olah: That's a great question. So earlier you might have heard me sort of dissembling on, or trying, being confused about, what is a feature. And I think this is kind of, kind of the question here, where I feel like I there's a thing that I'm trying to point out, and I don't know what it means.

[00:33:11] And actually, you know, you might be like, oh, that's kind of wild. Right. Like Chris is like spending his whole career on this thing. And he doesn't even know what the definition that he's talking about is. But actually, I think that's often very productive. There's this book by Lakatosh, Proof and Refutation. It's just this play about a bunch of math students going and being confused about definitions and progressively finding more and more useful definitions for things.

[00:33:32] But, yeah, like I don't want to define features in terms of human-understandable concepts, because I want to allow for the possibility that these models have concepts that we don't understand. So what I really am looking for in terms of having human well, having having sort of reverse engineering these things, is in some sense pulling apart the fundamental units of computation, whatever they are.

[00:33:53] So far in the cases where we've done this, they have been things that we've understood and understanding the algorithms that drive that. And when I say turning them into code, I don't so much mean turning them into like Python code or something like this. The weights are the code.

[00:34:06] Like for the car detector that I showed you earlier, like there's no simpler explanation than just showing you, you know, there's a window detector. There's a wheel detector. You know, it wants to see the windows at the top. It wants to see the wheels at the bottom. That is the most understandable depiction of that, that there that there is.

[00:34:24] And so the thing, but you couldn't have understood if I didn't contextualize it, right? So there's this difficult work of going and contextualizing what these things are, going and undoing this problem of superposition, to the extent that it is, and then very painstakingly trying to go through it. And when we find concepts that we don't understand initially, like in vision models, there's these high-low frequency detectors. We initially didn't understand them. In retrospect, they're super simple. All they are is they look for something that's high frequency on one side, low frequency on the other. And that occurs all the time at boundaries of objects because you have sort of like, something is out of focus in a background or, you know, the background is busy and I'm a solid color or something like this. That's one answer.

[00:35:07] One other thing that I really sort of want to say here is I think that we have a enormous advantage over neuroscience. In fact, I wrote a whole article about all the things that make my life so much better than neuroscientists'.

[00:35:32] Audience member: [Inaudible]

[00:35:34] Chris Olah: But just for people who haven't thought about this, I just want to enumerate some of the advantages we have. So we have access to the ground truth of what every neuron's computational structure is. We can get access to the activations of every neuron in a non-destructive way, and collect them for as many stimuli as we want. We have access to the ground truth connectome. We know how every neuron connects together. In fact, not only do we know the connectome, we know the weights of the connectome. It's not just which neurons are connected to each other, but we know which ones excite and inhibit each other. We can go and take gradients through the structure and optimize through it. We can go and work with the same model again and again, rather than having to go and switch between models which might be different. And when I study a model, I can give that model potentially to you or to one of my colleagues, and we can go and ask questions of exactly the same model and talk about the same neurons.

[00:36:31] The experience cycle is so much faster. There are so many things. So it's just this truly enormous difference.

[00:36:51] Yeah. So I think the fundamental thing is, my present theory for thinking about this is, there's two things. There's the feature importance curve, which is if you imagine that there's, you know, order all the features that the model could learn in terms of their importance, which is like something like how much it reduces the loss or something like this, and look at that curve. And so the flatter that curve is, the more superposition you're going to have. The steeper it falls off, the less superposition you're going to have.

[00:37:17] And the second thing is the sparsity. And I think that on both of these, language models go the wrong direction. So on sparsity, vision models, when you start with something, especially with early vision, right, you're dealing with things like an edge detector. If you look at this thing behind me and you ask how many vertical edges are there in just this single image, you know, there are a lot of vertical edges.

[00:37:39] On the other hand, if you're like, you know, like a lot of these language models know who I am. How often do I come up in text? One in a hundred million tokens? One in a billion tokens? I don't know. Like, I don't think I'm very common. Those actually feel like they're probably too small a number, or something like this.

[00:37:57] So there's just a massive, massive discrepancy between the frequency and therefore the sparsity of features in early vision and the sort of linguistic features. And because you start with word embeddings and then you very quickly go to more sophisticated things, language models very rapidly go to denser features. You do find monosemantic neurons in language models and they all tend to be things that are very common. So you're like, oh, you know, like common constructions of multiple words or things like this - 'would not', for instance, or things like this.

[00:38:31] And then I think also just, the space of concepts you want a language model to represent is just so much bigger than like an ImageNet model or something like this. And so you just have this much flatter, longer feature importance curve.

[00:38:43] Audience member: [Inaudible]

[00:38:46] Chris Olah: I don't think so. I found that when you have, like, we have a tool, so we have at OpenAI, we built this tool, Microscope. We have a similar internal thing for language models, Lexiscope. It shows you all the data set examples. One really cool thing that you can do in language models, that you can't do in image models, is you can just edit the text and see how the neuron activation change in real time. That's like, a thing that I'm very grateful for. And so actually, I don't think, like, when you have interpretable language neurons, I don't think that they're harder to understand or recognize. Maybe it takes a little bit more time because you can maybe more quickly visually go through like a bunch of visual data set examples and language ones, but I think not in a very significant way.

[00:39:26] So I think that's not the primary thing going on.

[00:39:30] I believe I'm maybe out of time. One more, one more. Okay. Remember, I will be available at this kind of office hours thing and also just like very happy to chat with people all day. Yes.

[00:39:42] Audience member: [Inaudible]

[00:39:46] Chris Olah: Yeah, I really don't know. So one thing that's interesting is that superposition sort of depends on the gap between the sparsity of the neurons, and the sparsity of the sort of fundamental underlying data, and the fundamental underlying features. And so if you think that those features are extremely sparse, like they occur, you know, one in a million tokens, one in 10 million tokens, then just the sparsity, there's still an enormous, enormous sparsity gap.

[00:40:27] I think the more interesting or fundamental thing is something like how do you think the feature importance curve changes? And I don't know, like, well, yeah, perhaps this is something to talk about more offline. I think the basic answer is, I have no idea. And I can, like, give you a bunch of conceptual models for trying to reason about this, but I still then have no idea.

[00:40:49] Thank you so much for all of your time.

[00:40:51] Presenter: Thank you, Chris. Amazing.

Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability

Transcript

Alignment Workshop