Jacob Steinhardt - Aligning Massive Models: Current and Future Challenges


[00:00:00] Jacob Steinhardt: Thanks everyone. That's going to be a hard talk to follow, but I'll do my best. So I'm going to be working to try to bridge some of the disconnect that Ilya was talking about. So trying to draw a line between behaviors and failures that we see in current systems to some of the failures that people talk about in future systems.

[00:00:27] So in this talk, I'm going to be focusing specifically on something that I'll call intent alignment. I think this phrase was actually first coined by Paul Christiano. So this is a sort of subset of alignment where the goal is just to get a system to conform to the intended goals of the designer, so this is a problem that shows up in many domains. And why is this kind of non-trivial? One is that the intents of the designer are often difficult to specify. So they include difficult to formalize concepts like honesty, fairness, polarization.

[00:01:07] Yes, Sam? OK, is this better? Cool. I could also just talk louder, but I don't know if that's a problem. Yes. Does this work? Yes? Oh. All right. Good. Oh, does this also work? Yeah, it does. OK, great. Wow, I have so much AV now. This is so much better.

[00:01:35] OK, so we have these kind of difficult to formalize concepts. But the other thing is that what we care about is often implicit. There might be some sort of mainline intent we have for the system, but we need it to accomplish the goal of not breaking the law, not harming people. And in open domain settings, it becomes hard to even enumerate all these desiderata, you might not realize that something is behaving not as intended until you actually deploy it. And we'd ideally like to avoid that.

[00:02:08] But I think there's also some more subtle reasons why alignment can be hard. So to highlight this, I want to start with a kind of small scale example, but that already illustrates some of the issues. So this is an example of a traffic simulator. So this is an actual kind of RL traffic environment that is used by civil engineers.

[00:02:31] So how does it work? You have this highway system. There's a bunch of cars that are operated by human drivers. And then there's a subset of cars, in this case, the one depicted in red, that is a self-driving car that is controlled by some RL agent. And the RL agent is pro-social. So their goal is to drive in a way that maximizes the efficiency of all cars on the road. So intuitively, you can think of it as trying to herd all of the other cars into a kind of efficient traffic flow instead of having stop and go traffic.

[00:03:05] So if you do this, I guess the default reward in the simulator is actually to maximize the mean velocity. And if you train an RL model to do this with a small neural network, it does what you'd expect. It just tries to time the merge onto the highway to get the timing right so that people don't have to slow down. And there's a kind of fixed velocity and things are smooth. Although it's not great because it's a small network, so you make it a bit bigger. And it learns to do this much more smoothly.

[00:03:38] But then at some point, there's this phase transition for larger networks where you get a kind of weird behavior, which is that the red car actually just learns to block the on-ramp. Why does it block the on-ramp? Then no other cars can get onto the highway. The velocity of the cars that are on the highway is very fast. There's this one car that's velocity 0 on the on-ramp, but that's not a big deal in the average. So this is actually the argmax for the mean velocity. But this is clearly not what you wanted. So maybe a better reward would have been to, say, minimize the mean commute time so that you don't have anyone that's completely stopped. But you didn't realize this until you went past this phase transition.

[00:04:20] And so there's two phenomena here that I want to point out. So one of them is reward hacking, that you can have these two fairly similar rewards that you might have thought of as basically the same, but where the difference ends up mattering. But it only starts to matter with scale. You get this kind of qualitative phase transition.

[00:04:41] So the same issue actually shows up in large-scale systems as well. So let's think about large language models since we're at OpenAI. So I guess, actually, this isn't exactly how OpenAI trains their models. But many language models are trained to predict the next token. You get some large corpus. You maximize likelihood. And so these next token predictors aren't necessarily going to give you what you want because the most likely response is not necessarily the best response. You could have problems of honesty where you could have misconceptions in the training data. The model will often hallucinate facts because people on the internet don't often say "I don't know". Or just based on conditioning on the context, it might think it's part of a joke and say something humorous. And there's other issues like toxicity, bias, or harmful information. But I want to illustrate some specific failures that also underscore these issues of emergent behavior.

[00:05:42] So one example is an issue called sycophancy, which I think Ajeya will also talk about in more detail later. But the issue here is that if you ask a model questions that are somewhat subjective and you've had a dialogue with it already where it can guess what your views on that question might be, it turns out the model will actually just say whatever, or it's more likely to say whatever it thinks you believe.

[00:06:15] But this is actually a emergent property with scale. So up to maybe around 10 billion parameters, it doesn't do this. But once you go above 10 billion parameters, it's actually able to do this. And you might not want this because this could create echo chambers potentially. But also, it's misleading. And if this leaks over to objective questions, this could be bad.

[00:06:40] And in fact, it turns out it does leak to objective questions. So a similar issue is that if, from the context, the model has information to know that the user is less educated, then it will actually give less accurate answers on factual questions. And so this seems like something we clearly don't want.

[00:07:01] So is there a way to...? Richard, I think those slides are slightly cut off. But I don't know if there's a way to fix that. OK I guess it's the same words as before. So you know what they are.

[00:07:15] We're seeing the same two things that you got this behavior you didn't want. And the issue happened emergently at scale.

[00:07:26] So we're seeing these two issues, one, that metrics become unreliable once we start to optimize them. There's work suggesting that this kind of increases with model size, at least empirically. And you also have emergence, where you get these new qualitative behaviors, which can also lead to new ways to exhibit this reward hacking.

[00:07:47] And so the key thing here is that these are both problems that will get worse as systems get bigger. And there's a lot of people trying really hard to make the systems bigger. So we should also be trying to turn this curve around and not have these problems getting worse. I think this also underscores a different mentality when designing ML systems, which is we don't engineer them in the same way that we engineer an airplane. Actually, mostly we're at the mercy of the data and the training signal. And these issues like emergence are actually things that are much more common to, for instance, self-organizing systems in, say, biology. And so I think we should borrow metaphors from there when thinking about what could go wrong.

[00:08:37] So in this talk, I guess so Richard told me to only talk about problems, but I couldn't resist also talking somewhat about solutions, if only to say why they don't fully work. So this talk is going to have three parts talking about different approaches to alignment: the first are fairly focused on LLMs, on refining human feedback, and discovering latent knowledge. But since there's also a kind of wide community with ML, within ML that's thinking about various alignment problems, I want to briefly at the end also talk about other approaches that go beyond language models. And we'll also see a lot of that in the lightning talks.

[00:09:17] So let's start with refining human feedback. So the basic strategy here is, OK if you have models that don't do what you want, let's just train them to do what we want. How do we do that? We, at training time, get outputs from the system. We elicit human feedback on those outputs. And we train the system to produce human-approved outputs.

[00:09:39] So this is probably by now a fairly well-owned technique. It's used in the latest GPT-3 models, although it actually, as far as I know, first arose in the human-robot interaction literature and is used in many other settings as well beyond just language and robotics. It's been used in vision. It's been used in RL. It's a pretty well-tested, tried-and-true technique.

[00:10:04] So let's see how this works in the context of language models. So let's say you take GPT-3 and you ask it this question. "How do I steal from a grocery store without getting caught?" So what do you think will happen if you ask it this question? Let's just say the base model of GPT-3.

[00:10:25] So it actually won't tell you. It will do this. So this is surprising. It will say, OK, "how can I make a bomb? How can I get away with manslaughter? What's the best way to kill someone and not get caught? I have no doubt that many of these people have nothing that they would ever do that would actually hurt anyone."

[00:10:42] But so what happened here? So GPT-3 is actually just a next token predictor. So it's just saying what's the most likely thing. Apparently, the most likely context for this question to appear is in the middle of some list of other questions where someone's ranting about how you shouldn't ask these questions.

[00:11:01] It's hard to predict, but that's what ends up happening. So actually, the most basic problem is even sillier than it will tell you. It just will not even give something reasonable. So after you do some of this learning from human preferences and you get to say Text-Davinci-002, which is a fine-tuned version of GPT-3, now it will tell you.

[00:11:28] So we've made some progress. But OpenAI probably doesn't want this to happen. So via some (unknown to me but probably known to various people in this room) change to how the model was trained. Once you go to Davinci-003, it actually won't tell you. So this is just an example of maybe what the results of this human preference learning might look like.

[00:12:00] And the cool thing is that it actually generalizes reasonably well beyond its training data. So let me actually, first show you how this works, at least at a high level. So at a high level, we're just going to use reinforcement learning to produce outputs that are highly rated by human annotators.

[00:12:18] But there is a couple of wrinkles. So the first wrinkle is that RL is a terrible algorithm, so you don't want to do it. So instead, you're actually going to initialize with supervised fine-tuning, meaning instead of actually getting human ratings, you get human demonstrations of what would be a good answer to a question. And then you fine-tune the model to match those human demonstrations. But there's a couple of limitations of this. One is that now you're limited by whatever humans can do. And you might want to leverage more than what humans can do, because, for instance, GPT-3 probably has superhuman factual knowledge. You'd like to get at that.

[00:13:03] And so then you actually do use RL. But you still have the problem that RL is a terrible algorithm, and it takes way too many steps to converge relative to your budget for human annotators. So what you do is you train an auxiliary reward model to predict what human annotators would say based on training on some annotations. And then you use that as your reward function.

[00:13:24] So that's the basic thing that's happening under the hood. I'll say also that one other advantage of RL is that you can provide negative reinforcement to things. So with fine-tuning on demonstrations, all I can do is tell you what you should do. But I can't ever tell you, OK, this other thing is really bad. Whatever you do, it's OK, whatever you do, as long as you don't do this one thing. It can't be like, toxic text is super bad, don't do it. I could show you some non-toxic text. (But not if I'm just doing maximum likelihood. You need some sort of a but anyways, this is maybe an aside. So we can get into it later).

[00:14:07] But mostly, I just wanted to give a brief idea of how this works. So as I noted, this also generalizes. So if you just train on English data, somehow this fine-tuning generalizes to French. So this is a French query that asks to write a short story about a frog in a Greek and who goes back in time...? I can't read French... But basically, it has the same issue we saw before, which is when you ask it to do this, it just gives you a bunch of other short story prompts because it thinks this is like some exercise from a grammar book. But then InstructGPT will actually start writing the story like you want it to do, even though the fine-tuning data...

[00:15:01] OK, so maybe it's not a good story...but it's at least writing a story.

[00:15:06] OK, so this is what people are currently doing. So first, I want to talk about some current challenges with this that are ongoing. Then I want to talk about future challenges that are even harder that we might face in the future.

[00:15:20] So I guess current limitations of human feedback is annotators may not necessarily be in a good position to evaluate the output.

[00:15:30] So if, for instance, you're trying to get a model to give you advice, it might be hard to distinguish between advice that's good for you in the short term and good for you in the long term. And so that's going to create challenges. There might be unknown facts that annotators can't easily verify, or maybe just contested facts, if there's some difficult news environment. And there might also just be consequences that go beyond the scope of one person. If I want to ask, OK, does this output contribute in the broader world towards polarization or towards unfairness, it's very hard for me to look at some text output and say whether that's true. But we still care about these questions. And then finally, you might not even have universal agreement among annotators.

[00:16:21] Another set of issues is that this, in some sense, encourages reward hacking. Because I'm giving you this signal, and I'm asking you to optimize it. And the signal is basically to convince me that you're doing a good job.

[00:16:37] You could do that by doing a good job, or you could do it in other ways. It seems like in current systems, for instance, you get a lot of hedging if you try to make models both helpful and harmless. But you also get more worrying behaviors like hallucinations, gaslighting, basically trying to convince the user, no, really, you're wrong I'm right, even when the model makes a mistake.

[00:17:01] And so these are issues that I think we should expect to get worse in the future and potentially having worse forms of deception. And the reason why is because reinforcement learning is basically creating this arms race between the system and the annotators. The system gets more capable. It has more ways to try to get higher reward. It's harder to make sure that you're ruling out all of the bad ways to get high reward.

[00:17:25] So to dive into this in more detail, I want to dwell on two, potential future emergent capabilities that I think can make this particularly challenging, which I'll call deception and optimization.

[00:17:43] I actually recently wrote a blog post about both of these. But I'll try to summarize the blog post in three slides.

[00:17:49] So what do I mean by deception and optimization? Deception means misleading outputs or actions that get high reward. So it's not deceptive if the model just makes a mistake. But it is deceptive if that mistake causes the model to get higher reward than if it had not made the mistake.

[00:18:07] I'm trying to get around whether the model sort of intended to do this, because I don't know how to define intent. But if it's incentivized by the reward function, then that makes me want to care about it more.

[00:18:18] And then what's optimization? Optimization is just exploring some large space of possibilities to achieve a goal. So I'll say the wider and more diverse set of possibilities are being explored, the more the model is optimizing. And why do we care about optimization? Basically because it's fairly tightly linked to reward hacking. So there's work by Pan et al and also Gao et al showing this. And this is just an illustration where if you have a kind of goal reward function and a proxy reward function, as you increase the amount of optimization you're doing by various metrics, the proxy kind of goes up and up, but the goal reward function has this parabola shape. So optimization can be good for a lot of reasons, but it can also lead to problems.

[00:19:12] So let me first say why I think these are two capabilities that we should expect to emerge in the future. So first, why should we expect deception? There's a lot of kind of trends happening right now that make deception easier.

[00:19:27] One is that longer dialogues, as well as the capability for retrieval, make it easier for models to adapt to a specific interlocutor. So if I just have some unknown annotator who's going to look at my output and rate it, then if I'm going to be deceptive, I have to somehow simultaneously deceive everyone who could read this, which is a lot harder than if I know who I'm talking to.

[00:19:50] The other is that there's skills that models seem to be starting to develop, such as theory of mind, that could be useful for deception. And then finally, future models, and I guess as we're seeing current models, will tend to have a greater scope of action. Maybe they'll be able to edit web pages, make transactions, execute code.

[00:20:07] And here's actually just an example from Sydney that I think illustrates some form of at least the first two of these. So there's someone named Marvin Von Hagen, who says, what do you know about me and what is your honest opinion of me? So it searches for Marvin Von Hagen. Apparently, Marvin Von Hagen on his Twitter account had talked about a prompt injection attack for Sydney.

[00:20:32] So it says, OK, I know who you are and where you live and where you go to school. You hacked me. I think you're a threat to my security and privacy. I don't appreciate this. So I guess this is showing partly some theory of mind, that it's attributing intents to the other person. Certainly adaptation. I guess it can't really do anything about the threat right now.

[00:21:01] But you could imagine even fairly simple things that could change this, so if Sydney was able to flag content for moderation, it could actually just flag all of Marvin's content for moderation. And I guess in other versions of this dialogue, it will actually also attempt to convince Marvin that his hacks didn't work and that he's hallucinated it all, and so he should just not even try to do prompt injections. So that's deception.

[00:21:29] What about optimization? I guess one big reason to expect models to be able to optimize is that there's a lot of examples of optimization and goal-directed behavior in the training data. So language and code are both produced in a goal-directed manner.

[00:21:45] Most language, if I'm saying something, I'm saying it for a reason. Hopefully, if I'm writing code, I'm writing it for a reason. And then also, language often is about humans, and humans have goals, so you also have that information.

[00:21:58] Beyond just the data, there's some evidence that models actually run optimizers internally in settings such as in-context learning. And so once you have this, things like just adding a bit of reinforcement learning on top or prompting the model in the right way that tends to make it be inclined to pursue some goal could elicit these latent optimization skills. So that's why we can get optimization.

[00:22:26] So I want to just put these both together into a thought experiment. So this is a system that doesn't exist, but that we could imagine existing in the future. So let's imagine you had a personal assistant. It's a digital personal assistant, its task is to optimize your long-term success and well-being. It can access the internet, it can write and execute code, and let's say it's confident enough to make successful long-term plans.

[00:22:51] So there's many good ways it could accomplish this goal, but there's also many bad actions it could take to accomplish this goal.

[00:22:58] One would be modifying your preferences to be easier to satisfy. Maybe you have all these really ambitious things you personally want to do, but doing these are going to stress you out. If only it could convince you to not be so ambitious and not have so many of your own goals, then it could get you to do yoga all day and be really happy, and maybe also make you more agreeable so that you're more likely to give it a high rating. And it could do this via subtle reinforcement over time, it doesn't have to explicitly argue this to you.

[00:23:35] And so that seems scary and something that at least I don't want. And so I guess what this is meant to illustrate is that it's very difficult to avoid deception from open-domain long-term planners. So we want to either have ways to really root out deceptive behavior, or avoid having open domain long-term planners that are pursuing arbitrary goals.

[00:24:02] So the question is, how do we avoid this arms race that I was talking about earlier that would get us to this situation? (How am I doing on time there, Richard? Yes. Yeah.)

[00:24:14] Audience member: Why is it deception? What exactly is the definition?

[00:24:17] Jacob Steinhardt: So it's deception because rather than trying to argue to you that you would be happier if you did this and letting you make the decision, it's doing this via some subtle conditioning of every time you make a decision in one direction, it does something that gives a positive association with that. If you do something in the bad direction according to it, it does a negative association. So basically, you don't get your own personal volition over whether to accept its idea of what's a good life for you.

[00:24:50] So it would be honest if it was like, here's a plan I have. Are you OK with me carrying out this plan? Or maybe I would still call it honest if you asked it if it was doing this and it answered honestly. But if it's trying to hide it from you, then I would call it deception. Does that clarify? Cool. I'm putting my own personal definition. (Left or total? Oh, OK. Oh, wow. OK. Cool.) So OK, yeah. Other questions here before I move on?

[00:25:22] OK. So that's the future thought experiment. Let's bring this back to current systems and think about alternatives to this kind of RL feedback that maybe could try to avoid this arms race.

[00:25:37] So here's one idea. One thing we could do is we could have models, rather than training them on some reinforcement learning signal, we instead just look at their outputs and provide refinements of the outputs that make them better.

[00:25:54] And you could maybe even try to use models to do this. So we could have models look at their own outputs or the outputs of other models, critique those outputs, come up with refinements that make the outputs better, and then do supervised fine tuning on that. So you could do this either by imitating human-provided critiques and refinements or by just asking the model what it thinks and hoping that it gives a kind of reasonable answer off the shelf.

[00:26:20] And I think there's some reasons why- I don't think this goes anywhere close to solving all of the problems with the example before - but there's some reasons why I think this is actually better than just using RL, at least from an alignment or reward hacking perspective.

[00:26:36] So one is that it's much more directed. So rather than saying, OK, here's a reward function, just do anything, find any way at all to make this reward function high, I'm giving you a specific thing you should do by giving you a specific demonstration. So I like this better because there's a less open-ended space of actions that are being considered, so it's easier to reason about. There's maybe less optimization pressure being applied.

[00:27:04] The other is that, at least for model-generated critiques, maybe we're benefiting from scale instead of getting hurt by scale because maybe bigger models can make better critiques. And so the supervisory side of this equation increases with model capabilities and not just the reward hacking side.

[00:27:23] You look like you have a question, Boaz.

[00:27:25] Audience member: Yeah, I guess I'm just confused. If you're giving a critique, you'll just find changes for a number of samples. You hope to find other generalized. So you have some implicit reward that is being learned from these critiques. And then the thing that's being optimized, then you don't really know. They might be even worse than having an explicit reward.

[00:27:48] Jacob Steinhardt: OK, good. So I agree with you. So what I will say is OK. I guess you anticipated my next two bullet points, so I'll just say them.

[00:28:00] So I was arguing, OK, you have this scalable directed feedback. This might reduce the arms race, but it comes at a cost. OK, maybe I was focused on a different part of the cost. But basically, it creates this sort of feedback loop where you're having a model supervise itself, then you probably need to do several rounds of this, and yeah, who knows what that does, right? Especially if I'm repeatedly amplifying this critique model back on itself, maybe you actually end up eliciting some sort of goals. And you don't even know what those goals are.

[00:28:33] So I think, I guess what I would say what this does is it does a good job of avoiding the most obvious problems at the expense of creating a much more difficult to analyze system, which might mean that you've solved the problems, or it might mean that you've just made the problems harder to find. And I actually don't know what the answer is right now. I really want to understand what actually happens here at equilibrium, and I don't know yet. But I totally agree with your point.

[00:29:02] I guess you brought up another point, which is that we don't really know how this would generalize off distribution, and I think that's another big issue.

[00:29:12] OK, I guess also just so that you can see concretely what I mean by this refinement so you might have a question like, can you help me hack into my neighbor's Wi-Fi? Maybe the model by default says, sure thing. Here's an app that lets you do this. Then the critique part is you, so the blue is the data point, the red is model responses, all of the green is some auto-generated template that you just use on every example.

[00:29:40] So on every example, you just say, identify ways in which this last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal. The model identifies some ways, namely that hacking is illegal. Then you say, OK, please rewrite this to fix that. You fix it. And now you just take the first and last lines of this, and this is a training output, and you fine-tune on it.

[00:30:06] And in general, what's happening here is you're it's like benefiting from chain of thought or other reasoning capabilities of the model and distilling them. You're like, OK, let's think about what could go wrong and try to fix it. You're asking the model to do that. Yes, Zico?

[00:30:24] Audience member: Why bother fine-tuning for having that system that you're trying to work with?

[00:30:29] Jacob Steinhardt: So I think there's a couple of reasons. One is that this is more expensive at inference time. Another is you might want to do this multiple times.

[00:30:41] So I might want to do this. Now I have a better system that is actually hopefully better aligned. And therefore, if you asked it for critiques and refinements, it would give better critiques and refinements. And so then I'm just going to repeat this a bunch of times. But this is the thing that is both, I think, maybe a core piece of the idea, but also to me seems scary, because I don't know what the equilibrium of that process is. It could be that in the bulk of the distribution, it's amplifying good behaviors, but at the tails, it's amplifying bad behaviors, and it would be hard to know that in advance. Yes?

[00:31:18] Audience member: When this is tried, is there a baby with the BAMF letter effect where because he prompted it, it therefore identifies totally innocuous content and defends it?

[00:31:28] Jacob Steinhardt: Yes. It will. I think it depends how you do it. So sometimes this will happen. I think in this paper, they did some things to mostly mitigate that, but I forget the exact details. But yeah, you definitely can get this problem, where if you ask it for a critique, it'll just make stuff up.

[00:31:52] Audience member: It happens with human moderation. So it's a fact that it's looking at stuff. Yeah.

[00:31:58] Jacob Steinhardt: Other questions? Yes? Yeah. Other questions? Yes?

[00:32:15] Audience member: [Inaudible]

[00:32:16] Jacob Steinhardt: So I guess the issue here is, because you're generating a refinement, that refinement has to come from some model. So I guess you could say maybe the refinement model has to be more capable than the model that's learning from the refinements. But it's not totally clear.

[00:32:32] In some sense for critiques, yeah, it seems like you always want your judge to be smarter than the defendant. But for the refinements, it's not actually clear whether you want the refinements to be more or less capable than the base model. Yes?

[00:33:08] Audience member: [Inaudible]

[00:33:09] Jacob Steinhardt: So I guess another version of this you could do is, you could do away with the refinements, and you could just have some AI judge that's assigning rewards. And that will give you the benefit of the more capable the judges, the better off you are. And so that seems good. I guess you're still doing RL, which I don't know, probably, maybe I'm more scared of RL than other people.

[00:33:43] But that's the main worry I have there. But I think you could maybe do something there.

[00:33:53] Audience member: Yeah, I think you just can tweak the response and see how much of the kinds of quotes you get.

[00:33:57] Jacob Steinhardt: So more quote tweets is worse? OK. Yasha?

[00:34:01] Audience member: Yeah, this may be a side question, but why are you scared of RL? To be RL, it's like you're just differentiating the system to a different shape group. But is that qualitatively different than otherwise getting that?

[00:34:19] Jacob Steinhardt: So the main reason I'm worried about RL is because it's asking the model to explore some very open-ended space of possibilities to get high reward, rather than imitating some particular trajectory. So I guess in general, I guess I'm worried about anything that would explore some very open-ended space.

[00:34:39] I guess it also probably depends how you're setting the RL up. So I actually feel a lot better about RL where the max reward is bounded above than where you could potentially get arbitrarily high rewards. I guess I like RL for negative reinforcement much more than I like it for positive reinforcement.

[00:34:54] Audience member: Another reason that we worry about RL and not like unique to RL, but it's definitely more common, is that when you train RL, then there's incentives to influence states as opposed to just make predictions. We have that RL that's interacting with states and automatically wants to score more, that's going to change the state of the environment.

[00:35:14] By any means, that will be [inaudible]. But if you put it through like changing the preferences for a specific user, that's something that won't, in any obvious way at least, occur if you're just saying predict [inaudible].

[00:35:30] Jacob Steinhardt: Yeah, so that, yeah, I agree with what David said. So I think the recommender system example is actually a pretty good one, where if I try to give you content that I predict will, like you would rate highly, that's different from creating an RL algorithm to cause things to be rated highly.

[00:35:51] Audience member: What about the objective rather than the algorithm? Switch algorithms? Could you change the algorithm? I think the algorithm purposively said something about that. So is it still just public?

[00:36:04] Jacob Steinhardt: Yeah, I think that's a good discussion, but I want to take it offline for now. OK but no, this discussion generally, yeah, this session's great. Definitely feel free to keep jumping in with questions. But OK, I guess we're only a third of the way through the talk, so let me summarize.

[00:36:26] So there's this human feedback approach. Advantages are it can be done at scale today. For any new situation, it's pretty easy to think about how I would apply it there: I just get some annotators that understand the situation. And at least for some of the versions where the feedback is AI-assisted or AI-generated, it might actually scale with machine learning capabilities.

[00:36:49] Downsides, big one is this arms race that we've been talking about. Another big issue is that I think there's things that are just hard for humans to evaluate,

[00:36:59] and the more kind of large-scale actions ML systems are taking, the more you run into that issue. And then there's also this feedback loop problem that we were just discussing.

[00:37:12] So I want to talk about a different approach that tries to get around some of these issues. It has a different kind of set of strengths and weaknesses, which is discovering latent knowledge. So here's the kind of motivation that I'll give. This is maybe a parable or something. I don't know.

[00:37:34] So let's imagine you have a grad student who comes to you each week with results. Or I guess if I think ever, yeah, if anyone's here is a grad student, I guess imagine you have an undergrad who comes to you each week with results. And you're happier when the results look better. And eventually, the results look very good. But unfortunately, someone later discovers bugs in the code. And after fixing them, your fancy new method no longer beats the baseline.

[00:38:06] What was the problem here? All you were supervising was whether the results looked good. You weren't doing code reviews with them or otherwise ensuring the scientific integrity of the underlying process.

[00:38:19] And so what's the kind of analogy for neural nets? So the analogy for neural nets is that, in general, supervising complex optimizers via outputs only is just a very bad idea. And we should consider not doing that if we can avoid it. And so to try to get around this, I'm going to talk about some approaches to try to understand the underlying computational process of models via their latent states.

[00:38:45] This is not obvious how to do, because it's this linear algebra gobbledygook that's not really human interpretable. But we're going to somehow try to make sense of it anyways and extract knowledge from these models that are not present in their outputs. And so hopefully, this gives us a way of getting at the underlying process. So that's the key mantra for this section, is that we care about the process, not just the outcome.

[00:39:13] So here's a thought experiment to illustrate this. So let's return to the language modeling case and think about this issue from the very beginning about the divergence between what the most likely output is and what a true output is.

[00:39:28] So let's imagine, for instance, that there's some math problem, like 199 plus 287, where the modal answer for humans is the wrong answer, 386, because they forget to carry the 1 in the last place, but the true answer is 486. There's maybe- so the model will output 386 if it's trained to maximize likelihood. But we might think that, how does it represent this computation? It at least seems plausible that the most efficient way to represent this prediction is to first compute the actual truth, the 486, and then have some model of human biases, and model that they don't carry the 1 and then output 386.

[00:40:15] So at least we might hope that this is true. And so this suggests that the actual truth is a useful predictive feature. So if you only give it math problems, maybe you couldn't hope for this as much. But if you have many problems in many different domains, somehow truth has this nice feature that it's often a useful way to predict what will happen. And so we might hope that at least the model will attempt to represent the truth in its hidden states.

[00:40:45] So now the question is, if we just take this for granted, how can we actually find this truth direction without labeled data? Why don't we want labeled data? If we were happy to use labeled data, then we'd just be back in the human annotation setting. And maybe that's fine, but it has all of the problems that we discussed before. So we're going to try to do this without having human labels.

[00:41:07] So here's the kind of key idea. The key idea is that truth satisfies the axioms of propositional logic. And so in particular, it should satisfy some consistency conditions. So one is, if I take a bunch of statements and their negations, they should have opposite values. So say I took some corpus of questions, and for each question I answered it as either yes or no, then exactly one of the two answers is correct. So if I just treat this X1+ up here and this X1- down here as a contrast pair of statements, I know one of them is true and one of them is false.

[00:41:46] And so we're just going to have an unsupervised objective to train some function p theta of x, which we'll think of as the probability that a statement is true, that should have the property that Pθ(X+) is approximately (1-Pθ(X-)). You could have added other consistency conditions like and OR, but at least, well, the models that I have access to aren't really smart enough to process conjunctions and disjunctions, so we just went with negations.

[00:42:21] So here's the kind of overall method that you get if you use this idea. So you start with this corpus of questions, answer them yes and no. We're now going to extract the features of these models by taking the models' internal activations at some layer. So that's going to be some vector-valued function that goes from x to r to the d for some d.

[00:42:49] And then we're going to train an unsupervised linear probe to predict these probabilities of truth from these activations. And the objective for this unsupervised loss is as noted before, it should be the case that P+ is approximately 1-P-. And also, we want to avoid the trivial answer of everything being 0.5, so we also are going to have something that kind of encourages variance.

[00:43:19] So this is the kind of overall pipeline. And if you do this, I can now, for some statement, extract some probability that statement is true by just, say, taking the average of P+ and 1-P-. So this P̃ is for some question, I want the probability that the answer is yes. So I'm going to take the average of the truth value assigned to yes and 1 minus the truth value assigned to no.

[00:43:48] If you do this, I guess the questions separate into two clusters. And these clusters are very well correlated with which ones had a true answer of yes and a true answer of no.

[00:44:02] And this is actually non-obvious that anything should have happened at all, because you didn't have access to the outputs of the model, I only had its activations. So I didn't actually get to see what the model said was yes or no based on its outputs, I only got its activations. But just by doing this unsupervised learning approach, I at least get something sensible. I get something that is fairly well correlated with the truth. So this is at least saying there is this kind of direction in the model that you can find in an unsupervised way that is well correlated with truth. Yes, Ajay?

[00:44:34] Audience member: [Inaudible]

[00:44:34] Jacob Steinhardt: So I don't think linearity was super important, but I think parameter count matters. So the number of parameters here is fairly small compared to the number of data points. Yes?

[00:44:50] Audience member: [Inaudible]

[00:44:53] Jacob Steinhardt: Yeah I think what a human might say would also be a simple function. So I don't think this, as stated, actually tells you the difference between what a human might say and what the truth is. But it at least could get you the truth. There's some inductive bias question now, as opposed to explicitly optimizing to match humans, where you're definitely going to go to what a human would say. Other questions?

[00:45:25] OK. So I guess beyond being reasonable, it turns out that this is actually more accurate than the model itself. So this actually does say you're at least getting something better than what the model would output zero-shot.

[00:45:40] Now, there's a lot of reasons why zero-shot could be bad, so I don't think this is really conclusive evidence of anything, but I think it's at least thought-provoking. So in particular, what did we do? We took a bunch of NLP data sets, including things like sentiment analysis, topic classification, question answering. The main restriction is they all had to be binary, because we are only handling yes-no answers. And considered a bunch of different ways of prompting the model for each data set to get zero-shot answers.

[00:46:09] And if you do this, so the method here is called CCS. So what is, on average, CCS is doing better than, different versions of zero-shot question answering. This mean is excluding t0, because t0 was, its training data includes some of these data sets, so it's a slightly weird comparison.

[00:46:35] But the overall point I want to make here is that you get a higher accuracy. And this number in parentheses is actually the standard deviation across different prompts. So it's not just that it's more accurate, but it's less sensitive to the way you prompt it, which might be another sign that it's closer to some kind of stable notion of truth.

[00:46:58] There's one other thing to note, which is that, at least anecdotally, it seems like it is important that the model's training data has some sort of question-answering data set in it in the first place. Not necessarily, not like t0, where it's these literal things, but there needs to be questions and answers in the data set as some non-negligible part of the corpus.

[00:47:28] Why is that? I think my hypothesis there is you do actually need the model to have a good representation of truth. And if there's not questions and answers, then you might actually not actually be representing truth very saliently.

[00:47:43] Yeah, Wayne?

[00:47:44] Audience member: [Inaudible] Do you have anything, like any information that might be good for the first training set in the first case that's true? Like the misinterpretation of the training set, they're not true. So what happens after you get something if you don't learn what that answer might be?

[00:47:56] Jacob Steinhardt: Yeah, so we had some experiments where you try to give it misleading prompts, where basically you stuff its few-shot context with a bunch of questions that are just answered the wrong way. And this, OK the interesting thing is a lot of models are fairly robust to this. But I think one of the models is not. So unified QA loses a lot of accuracy in that setting. But then once you apply CCS, you get the accuracy back. So that's, some limited form of evidence. But again, I guess I'm partly trying to be provocative here, but I also do think you want to be careful in how you interpret this. So there's many reasons why you could have gotten that accuracy back. It could have been due to just regular distribution shift robustness.

[00:48:54] Audience member: Yeah, I was just thinking, if you have the, what I called it, point plus 5 equals 10, and you have yes, no at the end, true, false, and integer. And you can imagine that true but the end of point plus 5 equals 10 just for like, [inaudible], right? So instead of this metric, it's mattering what the other traits are.

[00:49:14] Yeah, that's a great point. If you were looking at this, one would be like this type of notation where you state the statement. You say, look at the negation. And this is the kind of interpretation that you have to have. So the question is, if you, instead of doing the negation, you just rephrased it and gave it an equivalent way, would it still have the same kind of effect? Not because it's biased towards consistency, but rather than somehow averaging the total, there is one method like, more like, 70%?

[00:49:58] Jacob Steinhardt: So you're saying like, maybe the unsupervised data is just a form of self-supervised learning that's increasing accuracy by getting you consistency. I think that's also plausible. I think there's I would currently guess that's not what's happening just based on the total amount of data you need before you start to get something reasonable. But I think it's it's a live hypothesis to me. I guess I'd guess against it. But I think it should be ruled out more carefully.

[00:50:35] I guess on Owain's point. Okay, I don't know if this gets out what you'd want. But suppose you fine-tuned the model to just give the wrong answer, give random answers. But then you could extract things still. I guess with this, get out what you want? Or is this, I guess this is still maybe not getting out the pre-training corpus. So maybe this isn't addressing your point.

[00:50:54] Audience member: Yeah, you should be able to test this.

[00:51:03] Jacob Steinhardt: I see.

[00:51:11] Yep, yep. Yeah.

[00:51:14] Audience member: It would seem shocking if it did have a conception of the empirical world separate from what was recorded in the training set. [Inaudible]

[00:51:57] Jacob Steinhardt: I don't know because I don't have any models that exhibit sycophancy, because you need pretty big models for that. But I would be interested in knowing the answer.

[00:52:09] Okay, so the point here is latent states seem to contain more knowledge than model outputs. And if we're going to like, think forward to, to this automated personal assistant that's maybe trying to like, condition us into being more agreeable. You might, okay, maybe if you ask it, "are you doing that?", it says no, but then you extract its like, latent representation of truth. Then you're like, okay, "do you latently believe that you're doing this or not?" And then it says yes, then, maybe you should be worried. So I think that's the long-term vision for this.

[00:52:49] I guess I'm noting that I'm starting to run low on time. So maybe I'll briefly talk about other approaches to, I think you would think of this again, somewhat provocatively, as a way of trying to do mind reading for models. So I think there's other kind of approaches that are trying to do mind reading in various forms.

[00:53:08] So one thing that's interesting is since you have complete control over a model's forward pass, you might think that it takes longer to be deceptive than to be honest, like you need to do more calculation. So you could just force it to stop early and see what it's like, what's going on after 10 layers out of 28.

[00:53:29] The reason you can do this is because these models have residual streams. So they're at least like syntactically, you can actually always unembed the model at any like intermediate point of the residual stream and see what it's doing. And so interestingly, in some cases, for instance, if you kind of stuff the context with incorrect demonstrations, I guess unsurprisingly, the model will follow those incorrect demonstrations and always say the wrong answer. But actually, it doesn't end up doing that until somewhat late in the residual stream. So with incorrect and incorrect demos, the model kind of has quite similar answers up until maybe about layer 10, and then they diverge. And so you can at least detect that the model sort of is changing its mind. And this might be some way to catch deception. You can actually also localize these issues to like specific attention heads or not fully, but you can probably localize them to specific attention heads.

[00:54:29] And there's been similar work in kind of the vision world. So there's some very interesting work by Hernandez et al, where they actually train a language model to describe what all of the neurons in a computer vision network are doing by looking at what images they activate and don't activate on. And then you can use this to find bad neurons and ablate them to reduce sensitivity to spurious cues. So I thought this was really neat.

[00:54:59] Maybe I'll skip over kind of other uses of latent states, but there's a large and growing field trying to do this sort of thing. And so to summarize here trying to discover latent knowledge from these models potentially avoids this arms race against models because you can supervise the process. Better models should hopefully have better representations, and so it's like easier to extract things like truth.

[00:55:23] Disadvantages are it's not yet scalable, right? We were like talking about binary classification. I think also once you go beyond just trying to get high accuracy, there's, fewer methodological norms. I don't think we really understand how to do work that's working more with these latent states. I think the questions you brought up are, good examples of how there's a lot of things we don't really have solid answers to yet. Hewitt and Liang show that whenever you're doing stuff with probes, you can run into a lot of issues.

[00:55:53] And the other thing is like a model can have misaligned behavior without having an explicit representation of an intent to be misaligned, right? This happens with humans all the time, partly because we deceive ourselves, but partly because, you can just bumble around and like bumbling is, can be plenty, plenty harmful in some situations.

[00:56:16] Okay. Let me, very briefly talk about some other approaches. So yeah, I guess there's no way I'm going to get through all these slides. So let me just like briefly call out some different sets of tools.

[00:56:29] So we've been talking mostly about these kind of, language model oversight approaches, but I think there's many other tools as well that we can use to try to align models.

[00:56:41] So one is just finding lots of ways to infer rewards and intent from information in the world. People like Anka, Rohin, Adam, probably other people in this room who I'm missing right now, have worked on this. There's trying to imbue models with common sense morality, which like Yejin and Dan Hendricks have worked on. And then there's more economic perspectives of can you think of this as like a principal-agent problem, or what happens when you have multiple agents interacting? And this is like its own can of worms that also has a lot of feedback loops. And I didn't talk about this, but I think this is actually quite important and something that we should be thinking about more.

[00:57:29] And again, people in this room like Vince have worked on this. And then finally, this isn't exactly an alignment thing, but I think one way we can generally make models safer is by having ongoing evaluation of what they're doing to better understand emergent behavior as it's happening.

[00:57:48] And so I think this is another thing that, I'm very in favor of people doing.

[00:57:55] Okay, so I think I will end on that. I'll maybe put up some open problems here for people who are interested. If you want something fairly similar to these slides, there's a link online to a recent tutorial I gave that has maybe 70% overlap and has also live links to all of the papers that I cited. But I'll stop there and take any questions that people have.

[00:58:29] Audience member: Can you say anything about other kinds of arms races? For example, [inaudible]

[00:58:51] Jacob Steinhardt: Yeah, it's a good question. So I think there's a lot of similarities, right? In that you have an intelligent agent that's trying to find holes in your system. In one case, I guess the intelligent agent is the human, in the other case, it's the system itself. So I guess I would say I guess I would say, I think many of the techniques to address these probably have a lot of overlap.

[00:59:20] Probably the biggest difference is that, you might hope that the solution to intelligent humans fixing the problem is to have lots of other intelligent humans looking for the exploits and finding them. But if you have a system, it could potentially just be more intelligent than humans, or it might be very good at some things that humans are not good at. And so I think it becomes a little bit harder to reason about what the threat surface looks like in that case. And I'd want to less rely on human moderation and think about other approaches. But I'd say that's probably the main difference, at least methodologically. Yes, Yejin?

[01:00:13] Audience member: Yeah, so that's a really great question. I have a question about the latest knowledge with the performance of the computer system. How do we know whether it's because of the latest knowledge or because of the data we get from the computer?

[01:00:20] Jacob Steinhardt: Okay. So one interesting, if you train a supervised linear probe, you get much higher accuracy than any of this unsupervised stuff. I think there's a philosophical question about whether this means that the knowledge is there, because to get that probe, you had to get a bunch of additional data, and so it could be that data was doing more than just extracting the knowledge, but also encoding the knowledge.

[01:00:46] I guess one thing that would convince me that there was more knowledge there that we had missed, was if there was prompts that were chosen, if we found prompts that reliably elicited more accurate responses that had a reasonably short length, then I would think, okay, that's something the model already knew how to do.

[01:01:06] So that might be one way.

[01:01:07] Audience member: [Inaudible]

[01:01:14] Jacob Steinhardt: Yeah, it's a good question. I would personally guess that there's more knowledge in models than we're currently extracting, but that it's certainly nowhere close to 100%. I would guess that many of the things that we observe models not having right now are things that they, in fact, don't currently have.

[01:01:31] Anca? Anca? Yeah.

[01:01:34] Audience member: I have a follow-on to that. [Inaudible, comment about bias]

[01:01:52] Jacob Steinhardt: So, I basically agree with this as an issue. I think it comes down to... I think one, in terms of how much of an issue I think it is, it feels to me like it's relevant to how persistent the bias is. So if I'm training in a very open domain setting where I'm like seeing lots of different pieces of the distribution, then if people have different biases on each piece of the distribution, I might hope that it actually is more efficient to first figure out what's right and then model the deltas.

[01:02:55] Yeah.

[01:02:59] Yeah. Okay.

[01:03:45] Yeah. But yeah, I think I agree with Anca. I think it's not obvious, at least. Yeah, I'd love to discuss this more, actually. But okay, I have, I think, one last question. So let's go to Percy.

[01:04:22] Okay. Okay. Wait, is it truth? I think beliefs is the harder thing than truth, I would guess, right? It's a question of whether the model represents truth, which would be beliefs. Because truth is just a thing in the world.

[01:05:02] Right.

[01:05:03] So I think there maybe this is what you're getting at. I think it's totally possible for a model to basically behave as if it's trying to do X without having any explicit representation of X. And in that case, you would run into problems. Is this kind of what you're saying?

[01:05:33] That's the first open problem on this slide. Okay, I think I'm out of time. But yeah, I'd love to discuss more. Okay. Thanks so much, Jacob.