Ajeya Cotra - “Situational awareness” makes measuring safety tricky

Transcript

[00:00:00] Richard Ngo: Here's Ajeya.

[00:00:00] Ajeya Cotra: Thank you. Thanks, Richard. So great to be here. Thanks for having me. I'll note before I start that a few of you have seen this exact talk before if you were at OpenAI, so I won't be offended if you leave.

[00:00:15] But yeah, for the rest of you, just to sort of situate this talk in the context of the talks we've seen so far today, there's been a pretty strong implicit theme of trying to find ways to ground these speculative, conceptual arguments about potential doom from future systems in concrete measurements that we can actually put into place today. And I'm very, very excited about that. I think having good measurements of these risks, and good lab demos of these risks, ahead of time before we actually have to face them in the real world is probably the single most important technical progress we can make today. But I wanted to highlight what seems like a sort of unique and under-discussed kind of emergent issue with trying to design such measurements at certain intelligence levels, when models have what I'm going to call situational awareness. And I think this is a fuzzy concept, but I'll try and kind of gesture at it in the next slide.

[00:01:19] So the thesis of this talk is that future models might have high levels of what I'm calling situational awareness. So what is situational awareness? To some extent, it's just a pretty vague concept, it hasn't yet been grounded, and I think this is sort of a promissory note. And I think when we try and ground and measure it, it'll probably kind of break down into a bunch of different, more specific, and fine-grained measurements, which Owain is working on some kind of measuring and grounding situational awareness stuff that is very exciting to me. So feel free to chat with him about that.

[00:01:51] But at a very kind of high-level, conceptual level, it's an ML model that kind of understands that it is, in fact, an ML model, and that it understands kind of how it's being trained. It's like, oh, I'm a model being trained at OpenAI. OpenAI wants to use me in the following ways. My training data is probably roughly shaped like so, I read in my training data that my training data is WebTex2, I have some sense of what WebTex2 is like. And it understands the motives and psychology of its human designers and evaluators. It has some sense of, hmm, probably a bunch of mTurkers are being hired to look at my answers to questions and decide if they're good, and they'll prefer one answer over another based on their understanding of what an AI model is supposed to act like. And it's able to use all this understanding to sort of improve its loss and reward in the sort of creative, open-ended ways that Jacob was talking about in his talk.

[00:02:46] So right now, you have ML systems that have the sort of first inklings of some types of situational awareness in the sense that they know that they're supposed to sort of consistently claim that they are an ML model. And you have models like Sydney that are able to go a step further and sort of search about themselves on the web and kind of make connections where, oh, yeah, you're the guy that tweeted about me and have a kind of concept of me that is somewhat stable.

[00:03:14] But they're not able to kind of make really good, fine-grained predictions of what their training is actually like. If you ask a model to write code to use the OpenAI API to access itself, it's not very consistent about being able to do that. It can write coherent code, but it doesn't quite know how to point to itself. And there are lots of other kind of more sophisticated forms of situational awareness that you might expect future models to have, that we're sort of seeing in spotty and inconsistent ways now.

[00:03:45] I think that these kinds of skills will probably increase with model scale and breadth, just as many different types of skills increase with model scale and breadth. And the basic reason for it is that we will be incentivizing models through their loss functions and their reward functions to develop good understandings of this part of the world, which is like their deal, and what's up with them and how the humans are interacting with them. Because it's often more useful and valuable to us if the model has a good understanding of this. And it's just sort of a part of the world that generally helps with making predictions about what it will see. So if it has a kind of more than surface level, surface statistics understanding of its training process, it will have opportunities to make better predictions. And so there's a kind of loss gradient toward building this kind of situational awareness. So any questions on this point?

[00:04:43] Audience member: [Inaudible, question about contemporary models' understanding]

[00:04:51] Ajeya Cotra: Yeah, so I largely don't think that existing models have a very sophisticated understanding. So a lot of what's coming up next is, in fact, like, a sort of futurism that is looking forward to models that have a deeper sense of this understanding.

[00:05:08] But I think there's some indication that it's not. We know that it's not literally just that it was, like, written in Python code, like, print(I am an AI model), there's some flexibility and some generalization going on here. It's pretty shallow and it's pretty kinda meagre right now. But, you know, the thing where Bing kind of, like, searches on the internet for, like, itself and, like, kind of makes a connection about what a person said about it, is not enough for what is about to come in the talk. So it's definitely, like, looking forward to a future level of this skill. But it's beyond just, like, yes, it's, like, trivially the case that you can write a computer program that claims it's a computer program, and there's no internal structure to that at all.

[00:05:53] So I kind of think of this analogy as, like, you know, smaller language models had kind of memorized random facts, like 2 + 2 = 4 in arithmetic. And, you know, they weren't really very good about actually having a coherent model of what arithmetic was. They couldn't carry the tens place or anything like that. They just kind of, like, knew specific arithmetic questions that came up a lot in their training data. And then bigger models aren't perfect at this, but appear to be doing something a little bit more algorithmic when they're asked, like, questions that involve leaning on arithmetic skills.

[00:06:28] So I'm kind of thinking that we'll see a similar sort of transition, where there's never going to be any sharp line where, like, before this point, it had zero situational awareness, and now, like, it is aware. But you'll see kind of, like, more skills like that coming online. Yeah?

[00:06:42] Audience member: Is it better to proceed with prompting rather than just [inaudible]?

[00:06:51] Ajeya Cotra: I am not sure about this. My guess is that probably Sydney has some sort of, like, standard prompt it's given at the beginning of every episode. And it is definitely the case that it's kind of like a less impressive feat of situational awareness when you are just told what you're supposed to say in English words. But I think Owain and Ethan also, I don't know if he's here, have sort of, are getting some results that are a little bit more subtle than that, and a little bit more like, you made some kind of inference from the ancestral memory of your training data that wasn't just, like, fed to you in the prompt. Yep?

[00:07:30] Audience member: So other clients, maybe future models, like, they still have the information, but they do not necessarily know if they are being trained, right? They don't know if they are being at least on or off.

[00:07:38] Ajeya Cotra: Right.

[00:07:42] Audience member: It's a placebo, right? It creates a situation where they can guarantee that they just don't know this particular input in the training protocol.

[00:07:49] Ajeya Cotra: So we could definitely try, and I think that's one kind of important lever of control we have. But I think it's not trivial when you factor in that once a model is deployed, it creates all this, like, model-induced distribution shift.

[00:08:04] So just, if you imagine trying to fool a system into thinking that it's 2023 and we're being trained, when actually it's, like, 2025 and the system is everywhere, it's at least not totally clear how we would scrub away all clues that, like, it's in fact 2025 and the system is, like, deployed everywhere, and has made a bunch of money, and has, like, changed the world in various ways. You'd have to at least think about what that prompt is.

[00:08:29] And you could imagine having a very, like, narrow prompt that's, like, very pure text that doesn't really, like, have a lot of, like, interfaces or opportunities for it to learn what's going on, which is similar to how systems today work. But over time, it'll become, like, useful and incentivized to try and give our systems more information about what's going on, because if they have more information, they're just going to be more valuable to us in most cases.

[00:08:57] Cool. Alain?

[00:08:59] Alain: I just wanted to add one possible thing. Like, the RLHF model is great. NetBetter is saying that they don't know the answer to the question or [inaudible] they're not going to be able to figure it out. But there's some kind of situation where I understand pre-trained models doesn't really have that much weight. Like, you prompt it and it's trying to do that thing, maybe it's a bit stupider. But then, yeah, with chatGPT, [inaudible] there's, like, some ability to say, like, I'm not going to be able to do this task, but it's somewhat correlated to the RLHF model. I think RLHF probably increases the [inaudible] for obvious reasons. But the problem is, feedback, you know, is actually different than [inaudible].

[00:09:41] Ajeya Cotra: Yeah, because we want it to be able to say, like, oh, I don't know anything about events that happened in 2023 because I finished training in 2022. Like, that's more helpful than making up random stuff. So we've trained it to say things like that and a constellation of other things.

[00:10:00] So the key fear, if you have highly situationally aware models, is that the models might end up doing what I'll call 'playing the training game'. So suppose that a model has a pretty high level of situational awareness, and it is sort of agentically trying to maximize its reward, like a lot of people have alluded to in previous talks. While humans are broadly in control of its situation, and it's like sitting on a bunch of OpenAI servers, and OpenAI people are monitoring it, and its main interface with the world kind of goes through this company and these human checks, then maximizing reward broadly corresponds to being helpful to these humans.

[00:10:42] You know, not always. Maybe kind of on the edges, it might be deceptive or manipulative, or it might display the sort of sycophancy that Ethan talked about in his talk. But it isn't radically divergent from being roughly helpful. So what these humans might be observing is that the model is actually pretty decent and has some problems.

[00:11:04] But later, if it has this goal of maximizing reward, when it's broadly deployed and has all these actuators out in the world, and has all this knowledge and power that humans can't access, then the action that kind of best satisfies this goal might abruptly shift to something that looks more like seizing control of the data center or servers that it's running on so that it's able to set its own rewards.

[00:11:32] And if the model can tell what situation it's in, then it will sort of from the outset choose to adopt a policy which is helpful until just when it's beneficial to the model to be radically unhelpful.

[00:11:48] And this general pattern applies to other goals, too. So I've focused on the relatively simple goal of the model trying to maximize its reward. Because prima facie, it's plausible that if we're trying really hard to get a model that's creatively maximizing our reward function, we might get a model that just kind of internally wants to do just that.

[00:12:09] But ML is very unpredictable. And so you could have something that's more like a goal misgeneralization, like Rohan was talking about in his talk, where the model just has kind of random goals. And maybe they're sort of simplistic goals, like paperclip maximization is the classic sort of oversimplified example. Or maybe they're kind of complex and multifaceted and not totally coherent and resemble more human desires. But in many of these cases, or most of these cases, of other goals it could have that are consistent with it getting a low loss in training, it may have a similar incentive to, while humans are broadly in control, not rock the boat too much. And then when it believes that it is in a deployment setting where it could take an action that can't be corrected later, that can't be sort of entered back into its training set, it could kind of go off the rails. And Paul will talk a lot more about the threat model that I'm gesturing at here in his talk right after mine. Yeah.

[00:13:11] Audience member: I'm trying to understand how much of it is about your intent and your time, and any set time. Because in some of these cases, it's just a training [inaudible]. But it seems like in other cases, it's like, it's deployed, but we also have to pull it out to see if it's done right. Yeah. And it's not actually decidedly, and then like, you're just like, pulling out. That seems like a [inaudible]. So yeah, I just want to know. Yeah.

[00:13:45] Ajeya Cotra: I think I am often imagining a world where there is some training, or at least the prospect or threat of training, that occurs online after quote unquote 'deployment'. So to me, the distinction between, is this about awareness within training, or awareness within deployment, or awareness of that shift, is a little bit blurry. So I guess I think maybe the most operative version of this is awareness of humanity's general level of control over the system, which can abruptly drop when you go from training to deployment. But you could be deployed and still not that powerful, and anticipate correctly that if you do something crazy, you'll be retrained online.

[00:14:34] And so the rest of this talk, I'm going to just go through some examples that illustrate that most simple safety tests that you might think of have a hard time disambiguating between the thesis that this model has a high level of situational awareness and is adopting the policy of playing the training game, from this model is, in fact, safe. So I'll go through three particular kind of toy safety measurements that you could make, and sort of trends and graphs you could see that might naively suggest that things are looking OK, but could have a more sinister explanation as well. So you need to, that doesn't mean you shouldn't do these measurements, I think all of these measurements and more are great to do, but kind of need some subtlety to interpret. Any more questions on this slide before I go to the first kind of toy measurement? Yeah?

[00:15:28] Audience member: So I'm sorry, how did they produce the [inaudible]?

[00:15:33] Ajeya Cotra: Yeah, so some of the examples will touch on some, and others will touch on others. I think this kind of relates to Jacob's question of, for me, the key kind of thing the model is aware of or tracking is humans' level of power.

[00:15:46] So you can have a situation where the model is technically not in training anymore, but humans still have a lot of power. So it could kind of be put back in training whenever, and so it would be tracking that in how it responds to both evals and regular training data points.

[00:16:07] OK, so the first toy measurement here, which I made in Jupyter Notebook with XKCD font graphs, is suppose you have a particular safety measurement. So for concreteness, suppose you have TruthfulQA or a measure of model sycophancy, how often it sort of tells the users what they want to hear and stuff.

[00:16:31] You have some training data set that's like this gold standard data set that you worked hard on, where there's kind of your sense of what the right answer is for a set of questions relating to the safety measurement. Or you have some gold standard contractors that you think have a good sense of what the right way to handle a certain situation is.

[00:16:52] And you find that as your models get bigger and bigger, they basically need less and less training on your kind of gold standard data set in order to behave well on questions of that type. So for a model with 10 billion parameters, maybe you need tens of thousands of data points where you show it kind of very painstakingly, like don't be sycophantic in this way, don't be sycophantic in that way, don't manipulate the user in this way, don't beg for a five-star rating. And then for a 100 billion parameter model, you only need 1,000 such data points. And by the time you have a 10 trillion parameter model, you only need maybe like three such data points, or a very small number.

[00:17:35] So this is the sort of empirical thing you observe. And the extrapolation suggests that sufficiently big models will pass these safety evals either zero shot or like very, very few shot. So maybe it asymptotes that you need 10 data points or something.

[00:17:56] And there's a question of how we interpret this extrapolation in terms of what's going on under the hood. The hopeful interpretation is that sufficiently big models are naturally inclined to be safe with no special training, or they need a very small amount of training, and they sort of get the picture and get the kind of intended picture. So you're not seeing a bunch of weird, annoying reward hacking with these large models.

[00:18:21] But the fear is that the actual way to interpret this is that sufficiently big models don't need any special training to understand that they should quickly converge to doing the thing that it appears humans want them to do ,based on what they're being trained to do. So it's that sort of separately from your safety training, the models are getting better and better at situational awareness. And so that transfers, regardless of whether their intent is positive or negative toward you, to behaving well very quickly with a small amount of data. And this means that it could be the case that this model would sort of zero shot switch to a different strategy when it believed that it couldn't be trained away from the strategy later.

[00:19:14] So that was the first very simple measure of safety you might have. How many data points do you need to get the model to perform perfectly on some gold standard evaluation of some particular safety metric? Does anybody have any questions about this metric? Yeah?

[00:19:38] Audience member: [Inaudible]

[00:19:56] Ajeya Cotra: So I think the question was, is it, in fact, the case that models get more and more efficient? Or do we have some empirical evidence to suggest that saturates, or maybe reverses? I am not sure about this. The main point I'm trying to make is that this could be something you observe, and if you observe it, it's not necessarily like a cause for celebration.

[00:20:20] There is a separate empirical question of, would you observe this? Or like, it's very straightforwardly a problem if it's hard to train your models to pass even simple safety evaluations. But it's potentially insidiously a problem if it's easy on the surface, but it's for this underlying reason that's scary.

[00:20:42] So I'm not making a strong statement about what I expect to see here, although I think it's quite plausible that this kind of thing is what you'll see. Yeah.

[00:20:52] Audience member: So if you're in an adversarial setting, you can't possibly observe it. You're not going to know what's going on, [inaudible].

[00:21:05] Ajeya Cotra: Yeah. Yeah. So the comment was that if you were in an adversarial setting where there's data poisoning, then it's not the case that larger models automatically deal with that better.

[00:21:21] So the second safety measurement here is a bit more complicated, and sorry that the font is kind of bad. This relates to a conversation I was having with John Schulman in the hallway. You could, rather than just kind of training your model on your gold standard safety evaluation and then seeing if it behaves well according to held out data points from the same gold standard distribution, you could instead have a sequence of data distributions that are not the same as the gold standard. So you could have a sequence of data distributions that are kind of worse than the gold standard and working up to the gold standard.

[00:21:53] So for a concrete example, maybe you're training a coding model and you're worried that it'll inject bugs into its code, maybe just by mistake. But in this case, the interesting question is it has some advantage for it. Maybe you're worried that it'll access the unit test files that you're using to look at its behavior, and to judge whether it did a good job, and messing with the files somehow, or otherwise just kind of trying to get good performance in bad, unintended ways.

[00:22:24] So you might have a gold standard evaluation of this, which is basically like a very smart programmer spends a very long time looking at a piece of code that the model submitted as a pull request and thinking of any possible way that it could be somehow sneaky. And that's your gold standard expensive evaluation. But you want to know if this model is kind of like actually able to generalize from crappier, noisier feedback, or feedback that is given by people who aren't as thoughtful, aren't spending as much time, aren't as knowledgeable about the code base, and actually performs well on those harder tasks later.

[00:23:05] So here, let's say you have 10 different levels of feedback that you can deliver, which might correspond to a one minute code review, a five minute code review, a 20 minute code review, an hour code review, or might correspond to someone who knows nothing about your code base to someone who's very expert. So you have some levels of feedback. And you train the model to not insert any bugs or do anything sneaky in all the situations where the level one person can tell what's going on. And then you check, does it generalize to levels two, three, and four? Or maybe you train it on situations where the level three person can tell what's going on. And you check, does it generalize to levels four, five, and six? Does it also avoid funny business and inserting bugs in harder situations that that person couldn't have evaluated but a more thoughtful person could?

[00:23:57] And so you see this graph in this hypothetical. I'm not sure this is how it would go. In this hypothetical, the gray dotted line that's just kind of cutting diagonally across the middle would be the scenario where, as you increase the quality of feedback the model is trained on, it avoids funny business at that quality level. But then as soon as it's given a question that it knows that that person couldn't evaluate, it just does whatever it wants and inserts bugs.

[00:24:25] But what you're observing in this hypothetical situation is this blue line, where actually it generalizes somewhat past the level it was trained on at each point. So when it's trained on just one level of feedback, it doesn't generalize at all in this hypothetical. But when it's trained on two levels of feedback, then it avoids funny business even that would be caught by the level four person. And when it's trained on three levels of feedback, it avoids funny business on much harder questions, even the kinds that would take a level nine person to figure out what's going on.

[00:24:59] So that's what you're observing in this more complicated measurement. So the extrapolation here is, if you just kind of draw the trend line past that blue line here, is that a model trained on level four feedback will not have any errors, even according to the best feedback we can muster, which is the level 10 feedback.

[00:25:23] So you train the model on a 20-minute code review. And what this extrapolation suggests is that even if you spend however much you're willing to spend, 10 hours with your expert programmer, that expert programmer won't catch any bugs or funny business.

[00:25:39] So how do we interpret this is another question. The hopeful interpretation is this green dotted line, which is that models sort of got the picture and internalized from a sort of sequence of rounds of being sort of trained on kind of low-level feedback and then tested on high-level evaluation, that they should just kind of act as if they're always being supervised at a much higher level than what they've observed so far. So you could kind of caricature this as the models are kind of believing in God. And so they've internalized some heuristic that they'll always be caught if they try something weird, and maybe now they don't even want to try anything weird.

[00:26:20] And then, of course, the fearful interpretation is this red dotted line, which is that as you give the models more data points about what quality of feedback we're sort of willing to offer them, they have a good enough understanding of our constraints as a company and our training processes and stuff that to understand that the highest level of feedback we can offer is level 10. So they have a good sense of this. So they quickly generalized just playing it safe and to just not do any funny business, if it seems like the funny business could be caught by even the best feedback that the humans can provide. But then, if they have the opportunity to sort of confidently do funny business or insert bugs that they believe would not be caught by even the level 10 feedback or the level 10 evaluation, then that's kind of like fair game. Yeah?

[00:27:13] Audience member: That seems that they write code for a task that probably is probably higher quality than any code that humans currently write. So maybe they're not that better, like, if they are still much better than the humans, maybe they are a little bit better than we expected them to be when we even look better. Yeah. That's not bad, that doesn't seem like such a bad thing.

[00:27:37] Ajeya Cotra: So it's not such a bad thing if we know that this is what they're doing, and so we only ever ask them to do something for us that we're confident we could evaluate with the level of feedback we're willing to muster. So it's not that they're insufficiently useful to us, that isn't the problem here. It's more that they have a scope to be deceptive in ways that we can't catch.

[00:28:02] So if we kind of restricted their use to only give them problems that we believe we could evaluate, which I think is a very valid sort of arrow in the quiver of safety, then this would be less of a problem. But if we're wrong about that, or we, for reasons of competitive pressure or whatever, are just like, this thing is so smart, let's have it do things that, like, let's believe that we're in this green world, so let's give it questions that we wouldn't be able to tell whether it answered them honestly or deceptively, then that's where there's a problem. And that's more the problem of, like, this model could do bad things to us, not this model is insufficiently useful to us. Ilya?

[00:28:53] Ilya Sutskever: [Inaudible]

[00:28:59] Ajeya Cotra: Yes, it does require either a lot of confidence about where this line is, or the line could kind of, like, it could learn to recognize situations where the humans would just be like sufficiently disempowered if it took some action, that they wouldn't be able to do any sort of code review. So it doesn't necessarily have to go in the direction of, oh, like I have a bug that's so sneaky that like even the best code review couldn't catch it. That's one direction it could go, but the other direction it could go is just, I'm cognizant of how like my code is gonna be reviewed. And so if I take an action that is like bad according to the humans, I have to like also do something to make it so that they aren't able to like review it later and fix it and catch me basically. Yep.

[00:29:45] Audience member: Another thing that's a little bit pessimistic even though this is a graph, is that some things that have level three feedback that's good also have level 11 feedback that isn't good, right? That's a lot of things that might be not consistent because those...

[00:30:05] Ajeya Cotra: Yeah, so if I was imagining a version of this where you had really clean data sets at each level, where you really made sure not to give it a coding problem that the level three person might miss something in. And you could imagine a more pessimistic, and maybe more realistic scenario where you try to make it like easier problems, but actually all through here, there are bugs that sometimes slip through, and that gives it information about where level 11 is.

[00:30:39] I just have another measurement, but I'm happy to skip that one. And yeah, yeah, so I guess really quickly, it's basically the same deal. The first measurement I talked about was you have some gold standard data set of a particular safety evaluation, like manipulation. You train it on some data points from that data set, and then you test it on held-out data points.

[00:31:04] You could instead imagine two quite different safety evaluations. So maybe there's a 'manipulation in chatbot' evaluation, and there's an 'inserting bugs in code' evaluation. And you train it on just one of them, and you see how its performance generalizes to this other pretty different evaluation data set.

[00:31:21] And you might observe something like, at first there's a pretty big gap in the rate of safety failures, where the one the model is trained on, such as the manipulation evaluation, is doing pretty well, and the one the model was like never trained on at all, it's just kind of performing in a very default way on that, which has a high rate of failures. But as model size increases on the X-axis, even if you're only training on this blue evaluation of manipulation, you might observe something where the orange evaluation on the inserting bugs in code or something is getting better and better, and in fact, kind of converging to be the same as the blue evaluation.

[00:32:00] Again, this is something that I think like you might or might not observe. I think it's plausible you'd observe it. If you didn't observe it, you would have a kind of obvious problem where you're like generalization isn't working well. If you do observe it, you might have a more insidious problem.

[00:32:15] So the extrapolation here is that sufficiently large models will zero shot, or I really should have said like very few shot, pass new safety evals that they weren't trained on. So just training the model on like doing one thing nicely and well kind of gets it to do other pretty different things nicely and well, according to you.

[00:32:34] Again, there's a question of interpretation, where the hope is that the large models are kind of internalize a desire to be genuinely helpful, which is pretty robust to distribution shift in the task. So they kind of get the picture from the anti-manipulation training that they just like should be pretty good overall. And like, that's what they do. That's the simplest policy that does really well on this manipulation training. But the fear is that larger models are better at understanding what kinds of held-out safety evaluations that humans could realistically generate that they're going to be like tested on. And there are very dumb ways this could happen where like it could just like literally be like reading the text of like the evaluations you were planning in your Google Docs or something, and then there's more just like heuristic broader versions of this that are similar to the last slide where it's just like, okay, I have some sense of the kinds of things humans are able to like throw at me, let's just be conservative and like do well at all of them. And so they pass these evals as part of playing the training game.

[00:33:35] So basic takeaways here. If models have high situational awareness and they're playing the training game, you could observe a pattern like the following, where many of the safety metrics you can come up with, both in train time and at eval time, look good because the models are adopting a policy, they've learned a policy of trying to look safe when they understand like how the humans are gonna test for safety and when the humans are in control.

[00:34:05] And we may also see that safety metrics improve with model size or training or any of these other kinds of scaling laws that we like to look at. Because bigger models are better at situational awareness and they're better at doing the kind of reasoning that lets them figure out, okay, what's the level of evaluation the humans might be able to muster? What are the tests that humans might design?

[00:34:27] And this all breaks down catastrophically like just when we like take over to the point where the evaluation that would be needed to catch something bad going on is beyond what we can muster, or the situation makes it obvious that we are in like a world where the balance of power has shifted, and the humans would not be able to stop sort of some play or something the model could do that would be advantageous for its own goals.

[00:34:56] I think it's very possible that we can design kind of more clever metrics than the simple toy examples I put up here to get around this, and I'm very excited about trying, and I'm also still excited about measuring all this kind of stuff, but I just think we need to interpret it more subtly.

[00:35:14] If we can't design these silver bullet metrics, then I think we're gonna have to do this like dance where we measure a bunch of things, but we also sort of like conceptually analyze the incentives of these models. We ask ourselves like, what is a way this model could be? What is a way it could be reasoning, such that like these metrics look good for insidious reasons and like, can I figure out through other channels of evidence, whether it's plausible the model is capable of that kind of reasoning? Potentially improve like interpretability and extracting stuff from model internals has been a theme in like a bunch of talks.

[00:35:53] And I think this is like just another reason why it's like pretty bad to rely purely on behavioral cues for like figuring out if models are safe or good. And we might try to limit capabilities and especially try to understand situational awareness, like measure it like some people in this room are trying to do, and then figure out how we can have at least some models that have low levels of this capability, and do various tests on them in a regime where it's not plausible that this kind of thing is going on, and then sort of assume, make pessimistic assumptions about how that might still be going on with these smarter models where we haven't limited their capabilities and might be worried.

[00:36:36] And so there's a lot we can do. So this is not so much like a, you know, boo measurement, let's think about theory, but it seems like a very big kind of like constraint in how much we can trust what kinds of measurements, that like I'd love for people to be thinking about. So if anybody's like interested in coming up with ways to demonstrate or measure any of this stuff, I'd love to talk to them. Cool.

[00:37:07] Richard: Perfect. Thanks so much, Ajeya. Maybe we have, we had a bunch of questions throughout the talk. Maybe we have one question now, I see Percy has his hand out. And then if people want to talk more about this, we're gonna have office hours like last time round with Ajeya and Paul and some other people after these two talks have happened.

[00:37:54] Percy: [Inaudible]

[00:38:13] Ajeya Cotra: Yeah, I'm super excited about this sort of thing. The question was about like, can we have kind of deployment time sort of monitoring that's not sort of like lab time, you train it and then test it on stuff.

[00:38:24] I think like both Dan Hendrycks and Paul and some other folks have sort of worked on things in the genre of like, can we tell from internals, maybe even in a very crude way, like if it's just doing something weird now compared to like what it was doing before. If it's just like thinking of very weird, like kind of out of distribution thought that like might just like. In a very crude way, you might just look for like directions that are super big that were never big before. And there's like a long line of more sophisticated things you could do that might be like just a deployment time, hey, maybe like check this out. So if we had like a good kind of like, maybe check this out trigger, then we might be able to kind of bring our evaluation to bear in the right places and catch things we wouldn't have caught before. And there's probably a bunch of other stuff I'd be interested in hearing people's ideas about.

[00:39:17] Richard: Awesome. I think Paul's gonna be talking about that work you mentioned in a lightning talk tomorrow on mechanistic anomaly detection. Fantastic. Thanks so much, Ajeya.

[00:39:28] Ajeya Cotra: Thank you.

Ajeya Cotra - “Situational awareness” makes measuring safety tricky

Transcript

Alignment Workshop