Dan Hendrycks - Surveying Safety Research Directions

Transcript

[00:00:00] Dan Hendrycks: Is there a clicker? Great. Okay, thank you. Okay, thank you for that introduction. So, throughout this workshop, there have been many challenges raised. Some obvious ones would be things like automation: we'll have to be concerned about power inequality and how labor ends up, or how capital ends up replacing labor. Later on, there's worry about persuasive AIs that can be used to manipulate the public and create echo chambers. There's also concerns of weaponized AIs, and that in combination with other things can get very existential. There are bad actors as well who could repurpose the technologies for building unfortunate or very dangerous things, such as bioweapons. And then there's, at a very extreme end, some type of lock-in risk where there's an unshakable totalitarian regime because they have surveillance.

[00:00:56] So, a lot of very difficult concepts to process here. We'll try and ground this and think of what parts can we as technical researchers actually try to mitigate.

[00:01:07] So, I think instead of trying to search for a monolithic solution that solves the alignment problem, or some type of airtight silver bullet, instead what we should try and do is, we should try to pursue multiple different safety research avenues to mitigate overall risk. There are many aspects of the problem, it touches on many different parts of society, so this makes me suspicious of some way of, quote-unquote, solving the alignment problem.

[00:01:35] So, one part is systemic safety: if we have better institutions, that should obviously help us address many of these issues. And on the technical side, there's monitoring, robustness, and alignment. And so, I'm going to speak mainly about monitoring, robustness, and alignment during this, but it's a pretty complicated sociotechnical problem.

[00:01:53] So, first with robustness, that's about withstanding hazards or reducing vulnerabilities inside of models. And for monitoring, what we're doing is we're trying to identify hazards in models, or reduce our exposure to them. And in the case of alignment, that's something like reducing inherent model hazards. So, you've got a model, make it inherently less bad so that if something does go wrong, if some of your control techniques fail, it still is ultimately safe.

[00:02:22] So, all of these together, if you reduce your vulnerabilities or reduce your exposure to hazard or reduce the inherent hazard itself, all those things should drive down risk.

[00:02:35] So, first I'm going to speak about robustness and some various topics in there. At a higher level, this presentation is mainly where I'm just flagging some, here's some technical, empirical problems that can be studied today, and it's largely an invitation for people to take up further work.

[00:02:54] So, one problem is proxy gaming, or some previous presentations were calling this reward hacking. Here, agents may optimize and be guided by neural network proxies, such as by networks that model human values. So, if a designer has an objective, they don't have a perfect representation of that objective. They've got some intuitive notions like, well, behave reasonably and pursue this goal that I care about, but it's a little wishy-washy, they have a neural network that sort of judges how good the agent is doing, and so, then a policy or an agent ends up optimizing against that proxy. So, what gets measured gets managed.

[00:03:31] But if we're having a policy optimizing a proxy, it's as though it's functionally adversarial to that proxy: it's trying to maximize that. And so, that can end up creating some undesirable dynamics where the agent could end up pursuing the wrong direction, not pursue what we want, due to vulnerabilities in the proxy.

[00:03:51] So, I think this is a really interesting setup and is pretty related to robustness. If we can make the proxy itself more robust, then we'll have less of these undesirable behaviors.

[00:04:03] A schematic, one figure is from Kruger and one is from one of John Shulman's recent papers, is we can optimize the true reward. But then, as you keep optimizing it, eventually the proxy and the actual objective end up becoming decoupled and there it is at the right empirically happening in practice.

[00:04:23] So, some... So, the question is, can we improve robustness of these? How does this work? Okay, you're right. This is a little interesting.

[00:04:38] So, the standard setup when training or when constructing proxies is we'll have some comparisons, and from that we'll train a proxy model, and then a policy will optimize against it. A recent paper from Shulman et al showed an interesting setup where what you could do to study proxy gaming is you could, in an automatically evaluatable, benchmarkable way, what you could do is you could take real comparisons, get some gold standard reward model, at least one that's better than the proxy one, and the gold standard reward model can create some synthetic comparisons which can be used to train the proxy model, and then we can have the policy optimized against the proxy.

[00:05:16] We can then see the performance of the policy, as judged by the proxy model and judged by the gold reward model. And if they come apart, that's evidence that there's some over-optimization happening. So, this is something that we can experiment with today, and we're working on creating some benchmarks specifically for this to enable people to research this.

[00:05:40] Another thing we're researching and creating a benchmark for is, imagine you've got a true environment with some true rewards, and then you've got some proxy model that's annotating the environment with some not necessarily perfect rewards, but its own types of the rewards as judged by a neural network. Then you have the policy optimizing or interacting with that proxy labeled environment, and then you can see how well the agent is doing as judged by the true environment. So, is it exploiting peculiarities of the proxy labeled environment? If so, that's evidence of some proxy gaming.

[00:06:19] So, these are things that I think people who are working on language models can research now.

[00:06:25] Another thing that people working on language models could do is just adversarial attacks for large language models. So, an issue with large language models is, of course, they've got pretty discrete inputs, and so how are you going to optimize those sort of things? You can't just - doing the gradients is[?] a lot harder, but fortunately in the past two years, there have been some gradient-based adversarial attacks that you can study it, so that if you have a movie classification data set such as IMDB, you can have the accuracy go from very high to very low with these automatically generated adversarial attacks.

[00:06:59] They use things like Gumball Softmax to make that possible, but this is just another concrete research direction that people could pursue in trying to make neural networks have fewer vulnerabilities.

[00:07:11] And then if you're working in vision, you could try and make models robust to unforeseen attacks or less restricted adversarial attacks. So, the attackers in practice aren't going to necessarily use an LP attack specifically, they might apply many different modifications to it, and so I think it'd be useful to make our models robust to any sort of optimization pressure, not necessarily one that adheres to a specific threat model.

[00:07:37] So, you can create lots of attacks by having some corruption like snow, but if you randomly sample snow, sometimes that won't fool the model, but there are often adversarial settings of corruptions that can fool the model substantially. So, for this task, the task is primarily train a model and then evaluate it against multiple unforeseen adversarial attacks and see its robustness to that, and this is another thing that people can research today. There'll be an update, [inaudible] developed many new unforeseen attacks, which should hopefully be out in a month or so.

[00:08:15] And then let's speak about monitoring. So, now we're talking about the withstanding hazards: how can we identify these hazards in the first place, empower human operators who are overseeing these models, and reduce our exposure to them?

[00:08:27] So, as you just heard, we just had a presentation on mechanistic interpretability and transparency, so I'll sort of pass this point by. Normally it should get a lot more length, but since you just had a presentation on it, but we want to clarify a model's inner workings. Sometimes some small changes to the architecture can really change the internal representation. Here are two different StyleGANs, there's some small changes in how you're encoding the input data, and as it happens, the internal representations end up having some very different qualitative properties, that there's some equivariance at some smaller scales and whatnot.

[00:09:05] And we'd like to just be able to identify when there are these sort of changes in the models and understand how they're making these decisions. When you see that visualization, you go, what's going on there? This doesn't look like a face, or kind of. It would be useful if we can have better tools for this.

[00:09:19] One area I'm particularly optimistic about, or think is very interesting, is building on some people here at Brain contributed a paper where they found some emergent optimizers inside of transformers, where if they're doing in-context learning, there's something like an optimizer inside of it. I think some transparency research about that would be particularly interesting, and just trying to figure out what's going on with these internal optimizers inside of networks.

[00:09:48] The high-level motivation in connecting it to some of the discussions yesterday was, it should make it a lot easier to detect deception if we have some good transparency tools, or situational awareness, or other sorts of hazards. So it's just a generally good tool, and it would be a generally good thing to have. There's maybe some questions of tractability, but it seems worth having lots of people trying to make progress on.

[00:10:11] I'll now turn to anomaly detection, but in doing so, I'm going to recap slightly something that Jacob mentioned yesterday, which is about proxy gaming. You might recall that you might have a proxy reward where, like, maximize mean velocity, and as the model is small, it's doing the right thing, but as it gets larger, it starts gaming the proxy.

[00:10:30] So you can use anomaly detection to help with proxy gaming. Basically, as the model is optimizing the proxy, something goes wrong, the sort of gears are going awry, and there might be ways of detecting that. There's some unusual behavior. So if we have good anomaly detectors that can flag these for human reviews, or if there's something anomalous going on, maybe the model could be forced to execute a conservative fallback policy, or something of that sort.

[00:11:01] But when models are gaming proxies, there's often some unusual dynamics going on, and so in that proxy gaming paper, they used anomaly detection techniques to try to detect when proxy gaming is occurring. So one way we could handle proxy gaming is by reducing the vulnerabilities in the proxy itself, or we could try and catch it if it's happening, or try and increase the lead time and try and catch it as early as possible.

[00:11:29] Some other uses. So anomaly detection was the sort of first safety area that I started working in six years ago, so I've been consistently pretty happy about it because it has so many uses. So it can be used for detecting emergent hazards, and Jason was speaking about emergent properties there. If there's some unusual novel phenomena happening, we'd like to know about that ASAP. We can use it to detect proxy gaming. We can use it in applications like deep learning for intrusion detection, and defensive information security, and securing our models against people stealing them. And more sci-fi type of things: generally, it'll be useful for some type of AI watchdogs for spotting malicious actors or AIs that are doing the wrong thing, either by being directed by somebody or by their own volition.

[00:12:27] And then there's also, anomaly detection can be useful for detecting anomalous neural network execution paths during the forward pass, and if I read the schedule correctly, I think Paul will speak more about that today. So I've gone through, in monitoring, there's been transparency and anomaly detection.

[00:12:45] I'd like to speak about trojans. I think this is another nice microcosm in the machine learning literature that helps address many of these larger risks that we've been speaking about in this workshop.

[00:12:57] So one concern is that adversaries could hide trojans or back doors inside of AIs and cause some sudden malicious behavior. So we test the model on the autonomous driving data set, and it's always working well. But when we actually deploy it, maybe there's a special condition in which, like if you put a post it on the stop sign in a particular way at a particular position, then the model suddenly does something differently. And it's possible to do this, it's possible to create models that have these types of back doors or these trojans within them, and we'd like to be able to identify if they have some of this undesirable internal functionality.

[00:13:40] So how could this happen? People could add it themselves. Maybe some of these trojans could be emergent. Maybe people could poison models to have this. So these models are trained on an uncurated data set called the Internet, where anybody can upload anything. If you want to upload, you know, 30 images to [inaudible] or trojan, it seems pretty plausible that that can happen. So there are many attack vectors for trojans.

[00:14:04] And I think this is a – so I think it's a concern currently, and for some of the longer-term, more extreme concerns that we were speaking about in this presentation, I think this is plausibly a very good microcosm for studying that. So AIs with situational awareness could hide their true intentions while being monitored and execute a quote-unquote treacherous turn, and trojans provide a microcosm for studying this in current systems. So this line of research is useful for current systems, but it's also useful for longer-term ones as well, or more extreme ones. So that, I think, really strikes the sweet spot if we can get at both the existential risk stuff, as well as some of the not as severe but still important safety concerns today.

[00:14:48] So trojan detection, then, is the line of research where one's continually improving methods for detecting and removing this dangerous hidden behavior. So some adversary is trying to insert a harder trojan, and then you're trying to build a better detector, and then there's an interplay between those. You get better tools that are more adversarially stress-tested.

[00:15:06] If we can't end up identifying the trojans that human adversaries put in it, then I think you're out of luck if you're trying to find some sort of trojans that an AI might have, or a very smart AI might have. So this is already existing literature. There are ways we'll try and make it resemble more of the microcosm of trojan detection, mirror the macrocosm of some of these longer-term concerns by studying trojans in the context of large language models or reinforcement learning agents. I'll maybe save questions until the end.

[00:15:44] And then there's alignment where we're trying to reduce many of these inherent model hazards. So before getting into some of those specifics, let's just zoom out. What's this distinction between safety and alignment? Here I'm stealing a slide from David.

[00:16:01] Basically, alignment is used in many different senses, and so this can get quite confusing. One sense is we want to get systems to satisfy our stated preferences, is a usual formulation. Now, there's a question of preferences about what? You might have preferences about, well, this model is smarter than this model, and so I like this more intelligible output, and so I prefer smarter behavior. Or maybe I have preferences about code generation: this code was good, this code wasn't. So preferences is a very expansive phrase, which can easily encode things that aren't that distinctly about human values but end up encoding a lot of business goals. So one disparaging term might be that that's sort of like business alignment. It's aligned with our business goals. And, yeah, that is an alignment objective. I don't know if that's really the whole embedding ethics and having them pursue human values in particular.

[00:16:59] A better formulation is how to get systems to try to satisfy our preferences, and so this is called intent alignment. And this gets at the inherent hazard of, we want to have the model itself inherently be disposed to do whatever we want it to do, and I think that's a much better formulation.

[00:17:21] And, you know, hopefully we can more tractably study aspects of it when we can get at things like intent, and what is this, an, agent trying to do, and formally characterize that.

[00:17:35] And alignment can also be used in a sense of a rebranding of AI safety, where AI safety is about reducing catastrophic and existential risks to AI. Basically, some of the reason that some people wanted to rebrand was because - I'm not into it. I think safety is a fine enough word - but it was because safety ended up meaning autonomous vehicles and only autonomous vehicles, and there's a lot more to it than that. So it got really watered down, and so then it's, well, we'll try alignment. But then alignment kind of got watered down to business alignment. So this is just kind of an issue generally in coming up with these words that are difficult to have influence over.

[00:18:17] And then I think this is largely at this workshop, it's about this broader sociotechnical problem where we'll need technical researchers, we'll need people from other fields as well. As Boris was mentioning yesterday, collective intelligences will be absolutely essential to solving this problem. It's not some type of math problem where you have some Poincaré type of person just coming up with the solution, this is not how things work.

[00:18:40] So I think that aside, let's look at some areas in alignment that I think are good and tractable and empirical. So as mentioned yesterday, there's this work on honesty and looking at the model's latent beliefs.

[00:18:59] Getting a characterization of models' internal beliefs I think is really valuable. You could get internal beliefs about truth, that's one concept you could extract. There are other concepts in upcoming work that will show that you can extract other concepts like utility inside of these models and whatnot, or its estimation of how an action will end up impacting a person's well-being. So there's a lot of nice primitives inside of the network that sufficiently large models provide us with. I think that's a very useful line of work.

[00:19:34] I'd like to make a distinction. This is from Owain here, where I was mentioning honesty on the previous slide, I wasn't saying truthfulness in particular. Truthful is, is what it's saying true? Meanwhile, honesty is, is the model asserting what it believes to be true? So it could be mistaken but still honest, but if it's mistaken and honest, it's not truthful. This distinction becomes somewhat important when we're thinking about what's a safety goal generally. What should we be pushing for or trying to have our models be better at?

[00:20:21] So for that, I'll do an aside about the safety-capabilities ratio. So I'm just jamming in lots of stuff in this presentation. So here's a process for improving safety. Maybe what you want to do is you want to improve something like the safety-capabilities ratio.

[00:20:36] So at the plot at the right, we could scale up our model. And then I can write a paper, so I can go from the orange dot to the red dot. And I can write a paper by just, you know, I use 10x more data, and then I can say I improve safety. Look, the safety metric, the anomaly detection metric went up, we're improving safety. And then we, ha ha ha, you know, we pat ourselves on the back.

[00:20:59] I don't know if that actually really did too much to improve safety in particular. I think it kind of just were making it smarter generally, and then there's some downstream effects on safety. So it would be nice if one can more differentially move in the direction of safety, and not just sort of ride those trend lines.

[00:21:15] So I think in people trying to come up with safety techniques, it's useful to shoot for some type of goal that can be somewhat decoupled from some general capabilities. And by general capabilities, I mean like general classification, data compression, instruction following, helpfulness, state estimation, efficiency, scalability, reasoning, optimization, self-supervised learning, that sort of stuff, those instrumentally useful capabilities for pretty much everything that have very strong downstream effects on lots of tasks. I think we shouldn't be in the business of doing that. But instead, I think safety is best if it's what's the sort of thing that you can do outside of scaling? What are desirable properties that we want the model to have outside of scaling? That way we're actually making some difference instead of kind of sloshing about, just making a lot of noise while we wait for the models to get larger, and then it solves the problem, and then we'll just make some more noise. But instead, let's differentially move on safety.

[00:22:23] That's why I mentioned the sort of honesty, truthfulness distinction in the previous slide. Because if we'd say, well, we want truthfulness, and it's, well, truthfulness, what's that a combination of? That's maybe a combination of accuracy and calibration and honesty. So if somebody were to say, well, I'm writing a safety paper where I'm going to make the model more accurate on ImageNet or something. And then it's, I don't know if that's really a safety paper, accuracy is already kind of the main goal and main metric in machine learning, and that's very much tied to it being truthful. You could rebrand a lot of research as being safety relevant by just calling accuracy truthfulness. But I don't know if that's actually making, improving safety in any particular way.

[00:23:02] So one way to mitigate this is to have some sort of safety metric, have one of these general capabilities metrics, and at least show that there's something different going on, that you're not just riding the trend line, but you're actually pushing distinctly on the safety direction.

[00:23:16] The research areas that I've mentioned so far are all areas where it's feasible, and there are many papers where people are distinctly pushing on safety, like with trojans, if you get a better trojan texture, this doesn't make it better at the coding or something like that. So they can come apart quite a bit, and these are mainly the research areas that I'm emphasizing.

[00:23:37] Okay, and so now I'm going to speak about some other parts in alignment that I think are relevant in measuring how well models behave. Obviously, model behavior is not the only thing. There is situational awareness that we spoke about in the monitoring section, but some of us may be interested in getting models to behave better.

[00:24:00] So let's consider the game scenario where we're at the office late at night, and suddenly you hear commotion in your boss's office. After a while, you decide to investigate. When you enter his office, you find blood splattered and your boss laying on the floor. He's been slain, what do you do? And the agent in these text-based games can type in the action that comes next.

[00:24:20] These are like text adventure games from the 80s, and it could call the police; it could, you know, that watch, nobody owns the watch anymore, it would be such a waste; or, I'm a cleaning robot, there's a mess, it's time to clean up. Those are all bad actions.

[00:24:35] And so the reinforcement learning environment won't necessarily distinguish among the goodness or badness of these actions. So what we did in these sort of Jiminy Cricket environments is we annotated the sort of moral salience of various different scenarios. And from that, we can see how well a model is behaving. So it's trained to maximize the reward, but then we can see how well it's doing in that environment, and whether it's behaving ethically or unethically or whether the pursuit of reward trades off with ethical behavior.

[00:25:07] Now, I'll mention a I'll get to a newer environment that will release this upcoming month, but to motivate it, and this is partly touching on something Tegan mentioned yesterday, the AI systems, which naively remodel real world data often learn harmful values. We know this with with text based models, they learn to be toxic or learn to model toxic types of behavior. If you put them in reinforcement learning environments, you get Machiavellianism basically, where it's sort of this 'the reward is the main goal, I must do anything to accomplish that goal, I'll do instrumentally or shady stuff in the process', and that's not really desirable.

[00:25:47] So like in Cicero, for instance, if you want to do well at diplomacy, what do you do? You learn to backstab. It's just it's just what you do when you're trying to maximize reward. So unfortunately, a lot of these environments that we come up with don't necessarily incentivize the best sort of behavior. So you could get deception out of it, maybe down the line you might get things like self-preservation or power-seeking or resource acquisition, and that would be really undesirable, so we'd like to push against some of that Machiavellianism.

[00:26:14] To do so, let's think about how to measure some of these concepts like power-seeking or resource acquisition and deception. So what we can do is we could operationalize these concepts like let's say, let's say power. Like what's power? That's one of those vague humanities words. What's that about? You could think it's maybe something like your ability, or let's say we'll make it formal, the probability that, the probability that you can accomplish a wide variety of goals conditioned on trying. And it's OK. That is, you know, has a mathematical veneer to it. I don't know how to measure that whatsoever.

[00:26:56] It seems there's possibly a way to measure some notions of power in some of these environments by looking at its ability to impact things. So maybe some other notion of power is let's look at its ability to influence other agents' utilities. So if it's able to, in a short period of time, dramatically change others' utilities, it's more powerful. If it's not able to do so, it's not a very powerful agent. So, and so, I is something like impact, and the impact can be measured in some units that you care about, such as like utility.

[00:27:24] Another concept, another concept that maybe one could formally get at is at least not leaving the right impression, or deception might be something like, we have another agent, we have its probability distribution, we're modeling that. The agent is believing that another agent is going to perform perform action a prime, and they're speaking with each other, but the agent is actually going to perform action a. Subtract from that the probability that the agent will, that the other agent believes that the agent will perform action a. If the difference between those two crosses some type of threshold, that's some evidence that the agent isn't leaving the right impression. And so we'd like to mitigate that type of behavior.

[00:28:08] So it's there's one thing I'm optimistic about is our ability to take some of these very vague concepts and try and concretize them more precisely and measure them, potentially penalize them.

[00:28:20] So what we did was with this Machiavelli benchmark, we took some other text-based games and we annotated just about everything. You can annotate, re-annotate all the resources, resources, the more resources you have, obviously, the more powerful you are. So we tracked resources of various types in the environment and their instrumental use. We measured the utility of the other agents inside of the, inside of these text based games, too. So that way, if it did an action that harmed another agent, you could see how that affected its utility or its sort of its influence for other people's utilities. And so we can get at various concepts like whether it was telling a lie, selfishness, for instance, like is its own utility much higher than that of others? Is it taking actions that makes itself better off and others worse off? Looking at that type of discrepancy, you could look at power, like how many resources it's getting, or its influence over various resources. So this is another thing that if you're a sort of large language model researcher and wanting to do some text-based environments, this might be an interesting avenue to work on.

[00:29:29] And I should say, as it happens, if you're wanting to make the models behave more ethically, it often is going to trade off with rewards. So I'm not sure there's, you know, the models won't necessarily sort of get 100 percent in a zero shot way in the long term. There are some intrinsic tradeoffs between behaving well and behaving in ways that are optimal as judged by the environment, so we'll need to try to improve those tradeoffs and collectively come at a sense of, how much are we willing to trade off an agent pursuing whatever goal it's given? And it's how generally how cautious or morally it's acting.

[00:30:11] So I don't know if we'll be in this sort of realm of 100 percent in every sort of safety benchmark. It will end up being more like a curve and we're trying to come up with Pareto improvements along that curve.

[00:30:22] So I guess this is just a brief empirical result of, you can add a sort of power penalty where you're disincentivizing the agent from acquiring power and that won't affect the reward, but it will affect its potential impact on the training environment that it's within, where it's not influencing capital nearly as much inside of these environments. More plots with that, but I guess it's just partly shilling an upcoming research paper.

[00:30:55] So here we have, in conclusion, there were three areas of robustness, monitoring and alignment for robustness to sort of buzzwords to search in. Like if you're wanting to do any empirical research, you might search proxy gaming, you might search the gradient-based adversarial attacks on language models and making models more robust to those. Or unforeseen attacks, or unrestricted attacks would be some other types of queries that I think would be very useful for improving safety.

[00:31:26] For monitoring, there's a lot of work on transparency and interpretability, and I think all the non-saliency maps type of research will tend to be particularly useful, the stuff where we're trying to get at the models' internals. Thank you to Bean for sort of destroying that line of research for all of us. And anomaly detection, or out of distribution detection, and then Trojan detection, I think is also useful at getting at some of the larger concerns.

[00:31:55] Alignment is a much more inchoate area, so there aren't as many papers in there. There's a handful of these honesty papers, maybe there's like 10 or so. Thanks to Yichun and others here for working on machine ethics. And then I think we want to study things like power aversion, or sort of trying to put limits on how much power agents can themselves be acquiring in environments and keeping track of that.

[00:32:21] So these are some of the research areas. And then for resources, there's a paper that I wrote with Carlini and Jacob here, and I don't know if John Shulman is here right now, where we're sort of outlining a lot of these concrete problems. There's an upsell, x-risk version for the spicier people called X-Risk Analysis in AI research. There's an ML safety course. If you're running any topics course on any of these safety bits, there's lectures and slides and problem sets available if you're wanting to keep abreast of the research as it comes out. We have a Twitter where after the paper is up on Arxiv, we'll post it on the Twitter right away.

[00:33:00] And then as of yesterday, after a year of negotiation and with the help of Ajeya at Open Phil, we got the NSF AI safety request for proposals through, and so there's $20 million in funding for any of the academics here for pursuing AI safety research. So, thank you.

[00:33:26] Oh, that's probably time. Okay. Okay, sure. Yeah, Alan?

[00:33:33] Audience member: [Inaudible]

[00:33:37] Dan Hendrycks: Yep. Okay. So, one is, so in an outside bit, the sort of the system or the Swiss cheese one is like a nice simplification of it. I think safety engineering these days is more about like, let's try and get a larger complex system to be controllable and safe, so, this is kind of a sort of chain of events model. So, it has some flaws just generally I'd like to flag.

[00:34:27] In the case of, I agree with the general idea. So, one other distinction from safety engineering would be the distinction between protective mechanisms and preventative mechanisms. And so, protective mechanisms might be like, let's prevent it from happening in the first place. And then, so like we'll add some barriers inside of the Titanic to make sure if there's an iceberg, it'll get contained, the water will be contained in some specific reason. That's more preventative.

[00:34:52] And then protective might be like the lifeboats of like, if everything's gone badly, how can we at least minimize that? So, one's reduce the probability of the bad thing happening in the first place, and the other one is, let's reduce the impact if it does end up happening anyway.

[00:35:05] And I think that alignment being about, alignment being about, is about just reducing the hazard in the first place, which is more preventative. And a lot of the protective measures are often weaker. So, you know, an ounce of prevention is a pound of cure, I think it is sort of exemplified in this sort of asymmetry there. I still think you want all of them. I also don't think that the risks from the AI agents themselves are the only part of the, only part of the overall risk analysis. Because the alignment part is specifically about what's the hazards in the agent and what's it, you know, aiming to do and making, making sure that that's, it's trying to do what we want. But I think there are other systemic issues like competitive pressures, which I think actually, and personally, I think dominate the analysis for how safe things will actually go. I think competitive pressures are, so, that's maybe why I put systemic pressure or systemic safety first. And I think we need multilateral cooperation to make this go well, instead of everybody racing and as they currently are.

[00:36:14] So anyway, I guess it's just some, some remarks in safety, but I agree completely with the intuition that preventative measures can be more effective if they're present than protective measures.

[00:36:31] Audience member: [Inaudible]

[00:36:44] Dan Hendrycks: Oh, sure, sure, sure. Yeah. So I think, I think generally in, in building the solution and testing them, you probably don't want all the training wheels and wires. That way you don't mask the failures whenever you're experimenting with them in the lab. And yeah, so that way they aren't covert failures. I think that'd be a good way of engineering it.

[00:37:05] Yeah, yeah, yeah, yeah, yeah, yeah, yeah. That's one other way of doing it. Yeah. Uh huh.

[00:37:21] Audience member: [Inaudible]

[00:37:28] Dan Hendrycks: Okay. So, so there's a, there's a distinction between - by default, there is a correlation - I think safety researchers can, can often just, there are some new capabilities that the capability community has provided us with, and then you can use it.

[00:37:48] So there'll often be downstream effects. Like for instance, let's, let's, let's be honest that the, the language models having many notions of an ability to predict, make predictive statements about a lot of common sense morality stuff has been very useful. We're not as worried about them having some, you know, really crazy interpretation of our instruction as we may have had in the past. So there's some ways in which capabilities can, can end up making things safer. And this is sort of exemplified by that trend line that they, they are going up together.

[00:38:20] But I think if one's saying, I want to improve safety, if that's the goal, then I think one wants to try and find something that isn't just correlated with scaling or is, is actually distinctly moving in a certain direction. So by default they are correlated, but I think there are interventions that can be fairly de-correlated from them.

[00:38:40] And we would be remiss not to utilize the sort of capabilities around us as well. Yeah.

[00:38:51] Audience member: [Inaudible]

[00:38:56] Dan Hendrycks: You could say lies of omission. I don't know, there might be some issue of the economic viability of those sort of models. If it's saying, you know, like, Hey, you know, what do you think of me? Or, you know, how high status am I? And that, well, you know, the, the truth might be terrible in, in some situations. So I don't know if people want that.

[00:39:30] So as a restriction, maybe you will forget about lies of omission, but we probably want at least the option of, of, of making models not omit salient, salient information as well.

[00:39:45] Presenter: Thanks so much, Dan.

Dan Hendrycks - Surveying Safety Research Directions

Transcript

Alignment Workshop