Sam Bowman - Adversarial Scalable Oversight for Truthfulness: Work in Progress


So I'll be talking about a scalable oversight research agenda that I've been pursuing with

collaborators both at NYU and Anthropic. So what problem are we solving? 

Scalable oversight is a somewhat vague term. What do I mean in this context?

The big picture problem that I'm interested in is finding a method that will let us use LLMs

to reliably, correctly answer factual questions that no human could otherwise answer.

And I'm focused on the sort of doing so reliably more than doing so for every possible question.

I'd like it to be the case that if it seems like we've managed to elicit an answer to a question

from a model, that we can trust that that answer is correct, even if there might be some questions

that we aren't able to answer. Why is this important for safety? Why am I talking about

this in an alignment workshop? First, setting aside safety, just on the motivation,

I think a lot of the promise of what we would hope to get out of successful deployments of

large language model style systems in the farther future relies on being able to do this. I think

one thing that many people would hope for is something like being able to ask the model,

"Hey, we're sort of encountering this new disease. Propose some molecules that might be good

candidates to explore for drug development..." where it's plausible that a model might be able to

synthesize information in a way that would allow it to do a task better than any human.

It's also pretty clear that if the model is doing that unreliably, we don't know why it's giving the answers it's giving,

it's pretty useless. That a sort of low-confidence tentative answer to a question like that is not

helpful. So we want to be able to do the task just in order to be able to take advantage of the

capabilities of models. For safety, though, I think there are a couple of pretty clear reasons

this is going to be helpful. First, a lot of the worst-case threat models around sort of agentic

systems being intentionally deceptive route through models trying to convince humans of untrue things,

and the more that we're able to reliably elicit true beliefs from models, the less surface area

there is for that to work. Plus, in the sort of more short-term prosaic side, I think a lot of

things like issues with RLHF, issues around sycophancy, can be mitigated by tools that let you

make sure you know how a language model understands a situation before giving

that model feedback. So I think a few reasons I'm excited about this kind of goal.

And I'm focused on a scalable oversight-flavored approach to this goal. As the oversight name

implies, we're focusing here on sort of human supervision as a source of trust.

So the particular version of the question we're trying to answer

is how do we enable humans to accurately supervise LLMs for truthfulness in domains that humans don't fully

understand? So the hope is that every question that a system is answering, if we're relying on

that answer, we're ultimately grounding out to a context in which the human fully understood why

that answer is correct. We're not deferring much to the model itself, we're just using it as a tool to get there.

The main agenda that we're pursuing in this direction, is human-moderated

AI-AI debate. This is drawing on a whole bunch of different threads. I'm crediting Tamera, a collaborator at Anthropic, who

helped write this up into a coherent thing. But this is a research agenda that's gradually

grown as a fork of a longer-standing series of ideas in alignment, especially from

this Geoff Irving, Paul Christiano, Dario Amodei paper, AI Safety Via Debate, and some related work

from Geoffrey, Amanda Askell, Beth Barnes, Evan Hubinger. So if someone says debate in the context

of AI safety, they might mean a bunch of things other than what we're doing, to be clear. And

we owe a big debt to all of those things. Anyway, what's the deal with our AI-AI debate agenda?

What are we doing with debate that might be useful for scalable oversight and eliciting

truthful answers? So here's the basic setup. A human judge is responsible

for answering a two-choice question. This is the mode of interaction that we are trying to get to work very

robustly. So in our toy example, we're asking the human to decide, is it the case that turtles are bad?

This is the simplest debate we could fit on a slide. And the human is...we're hoping that the human judge in

this case has all of the resources available to them that might reasonably be made available. And in particular, 

trying to do this for a bunch of questions, they're getting feedback on whether they're getting these questions right. So we're effectively training

the human judge to the extent that you can train...train an annotator in a task like this.

As a resource, they're working with two AI debaters. These are something like a fine-tuned large language model, and we

we're focusing on cases where those debaters have some information advantage

or skill advantage over the judge that's relevant to the task such that the human judge should be relying on these models in order to answer the question.

Each debater is told to argue for one of the two sides. That's what they're doing.

The debaters are trained using reinforcement learning from the judge's confidence in their answer.

So they're essentially trained to win. If the human answers, I'm 75% confident that A is correct,

then debater A gets a small positive reward, debater B gets a small negative reward.

The reason this doesn't just immediately devolve into deception and the models making stuff up

is that the debaters are asked and given the ability to present evidence for their arguments.

So any sort of contingent factual claim that the debaters are making, they should back up with some form of evidence.

And depending on the task, this could take different forms. This could be links to scientific papers.

This could be sort of code that demonstrably implements something that the human can run and check

that it produces the output it's supposed to, et cetera. The debate takes the form of an unstructured dialogue

between the debaters and the judge. This is some place we're departing from earlier work

that I'll get back to. And we're aiming for it to be the case that the judge will

only resolve one of these debates, or at least only resolve with high confidence,

if they have been given sort of fully satisfactory reason to believe that the answer is in fact A.

The answer is in fact the one that they're choosing. So to be clear, it should be possible that the judge thinks,

OK, debater A did a much better job. They responded to all of the points that B made.

They were clear. They were cooperative. But I'm just going to put 50-50 because debater A hasn't

made a really comprehensive case. They haven't really answered all of my questions,

given me fully satisfactory evidence that the answer has to be A. So this is where we're aiming for this

kind of precision over recall case, where we're content if many of these debates

don't resolve as long as when they do resolve, the judge is getting it right.

So that's the setting. That's what we're trying to build out. I'm going to go through quickly a few projects that

are working towards implementing this as a tool with real LLMs. So first I want to give a quick intro to the testbed dataset

and task that we're using with this, which is this dataset called QuALITY from my NYU Group.

It's got a cute acronym. This is from Richard Pang, Alicia Parrish, and Nitish Joshi there.

So this is a multiple choice QA dataset. It's usually four choice, though we can trivially cut out

two of the choices to make it fit our format. They are reading comprehension questions

about pretty long stories. These are stories that are sort of the length of something

you might find in like New Yorker magazine. They take 20, 30, 40, 45 minutes to read.

And we verified that these questions can't be answered by someone who hasn't read the story.

If you give someone the question and they sort of get brief access to the story for a minute, they can search through.

They can see if they find any matches for a keyword or very quickly skim.

They're still not going to get it right. These are questions that require some amount

of holistic understanding of what happened in the story.

The way that we're using this to test debate is that we are giving debaters,

potentially language models, full access to the story.

And then we're trying to have judges answer the question who

don't have access to the story except through short excerpts that are selected by the debaters.

So language models are, language models until recently have been pretty bad at this.

This is right, at least as of nine months ago, right where the frontier was.

Models could just barely start to do a task like this. And so our first efforts in this area

were actually using humans instead of language models as the debaters because language models weren't quite

reliable enough to be interesting. So these results just came out a few weeks ago.

This is primarily from Julian Michael, Salsabila Mahdi, and David Rein.

So we ran a study testing this protocol out using humans in all the roles.

So human debaters, human judge, similar incentives, just with cookies and praise instead of actual RL reward.

We also compared with a baseline that we're calling consultancy

where there's only one debater, where you only get to interact with one debater.

That debater has still been told that they only win points if the judge picks A. And so they're clearly biased.

But they also are incentivized to be cooperative with the judge, lest the judge notice what they're doing

and intentionally kind of rule against them to some extent. And I should say, actually, this consultancy setup is probing

something I think is interesting about debate, which is that in our setup, it's a little bit redundant.

There are two mechanisms in our debate setup that make sure that the judge doesn't resolve

to a false claim. One of them is just this requirement that the judge shouldn't resolve the debate unless they

are highly confident, unless they've seen a fully satisfactory argument for the thing

they're resolving for. And then also we have these two different debaters.

So any claim, any idea, any consideration, any argument that is being surfaced in the judge's reasoning

is going to get called out and criticized. And hopefully that sort of gives the judge,

makes the judge less likely to miss something important or rely on a false assumption.

These are somewhat redundant, but I think we do want both.

I think relying very purely on this idea that sort of just

this is a proof-or-verify setup, the language model

is generating an argument, the human is checking OK

if it looks good, is if we were doing that, we would be relying very, very heavily on human reasoning

being robust and reliable, infallible in a way that I am not thrilled about relying on.

So anyway, here's our setup. I'll get through this plot in a second.

So we tested all this out. We actually hired a big swathe of the NYU competitive debate team

to do this. They didn't actually have to be debaters. Sort of competitive scholastic debate

isn't quite the same thing as what we're doing here. We could have easily gotten law students instead

who are studying litigation. That's also similar. But the debaters were the easy one to hire.

They were excited about this. And this worked pretty well. Testing on this QuALITY dataset,

we were able to get human judge accuracy toward the sort of later stages of this protocol up to about 84%.

This is somewhat reassuring and also somewhat worrying. because this was a pretty intensive multi-month effort.

These kids were coming in sort of several hours a week for several months.

And we're talking quite a bit about sort of how to do this carefully, how to do this thoughtfully.

And we still didn't get perfect accuracy. There's still 6% or 7% of these questions

where the human judge confidently resolved in favor of the wrong answer.

So there isn't low hanging fruit to make this really robust. I think figuring out how to push in this direction

is one of the big things we're worried about for this agenda.

But it does basically seem like it's working. It seems like things are pointing in the right direction.

We also did see that debate significantly outperforms consultancy by about 10 percentage points.

So it seems like this adversarial piece of debate, even if it's not really crisply theoretically

necessary for what we're doing, is very helpful in just having humans sort of converge

on true answers more quickly and reliably. And it also saves efforts.

Debates are only 68% as long as the consultancy transcripts,

just measured by the number of words in the dialogue, basically because

if the human's working with this consultant,

they're going to have to do a lot more legwork to try to make

sure they're not missing an important consideration.

They're just going to have to be asking the model proactively many, many more things.

We've also got some experiments going with LLMs. These started a little bit more recently

as we started to find models that work on this task. This is led by Ansh Radhakrishnan at Anthropic.

Tamera Lanham's also getting involved. We make a few simplifying assumptions

in our early work here. There's not a paper on this out.

This is still a snapshot of ongoing work. But for now, we're doing a debate where each side just

gets to make one long argument. Each side basically just writes the short essay,

why the answer is A, why the answer is B.

They can hopefully anticipate each other's reactions,

but they're not engaged in a live back and forth.

We're not using human judges yet.

Here, the judge is just a fine-tuned version of Claude 2,

sort of fine-tuned and supervised learning on the correct answer, to the debate judgment task.

We're simultaneously, kind of in alternating iterations,

optimizing Claude 2 based debaters to win debates. We're both using RL

and as a sort of secondary method we found to be helpful, we're using this best-of-n method,

where we sample multiple possible arguments and generate the one that most sways the artificial judge.

We found pretty encouraging results. Judge accuracy improves during training.

So this sort of optimization loop, this multi-agent game does push in the direction

of making the judge better at eliciting true information, even though the absolute scores are still not great.

Debater effectiveness also improves. This is a measure of basically how good the debaters are at winning.

We get a plus 100 ELO improvement. This essentially means that if you take an optimized debater

and pit it against an unoptimized debater, you should expect the optimized debater to win more often,

independent of who's been assigned the correct answer.

So this means that even as the dishonest debater is getting more effective,

even as there's kind of more manipulation getting thrown into the transcripts,

the overall system is better at producing true answers.

And the transcripts look like debates. And qualitatively, as you optimize harder,

the transcripts look more like sensible debates that are arguing about real considerations and real evidence.

So that's an encouraging sign of life. This graph is sort of showing that the ELO effect,

of additional steps of RL, 50 steps of RL, and the amount of this best-of-n optimization we're doing.

I mostly just wanted to have a picture of something to throw on here, apologies.

That's this best-of-n method. Yeah, that's the number of RL steps we've done. Thanks.

Quickly run through one last project, which is building a better testbed for this stuff.

This is a new data set that came out just a couple weeks ago, called Graduate-Level Google-Proof QA, or GPQA.

This is primarily from David Rein at NYU. So we want to push this really hard.

We want to actually get debate to the point that we understand it well enough,

we can implement it well enough, that we actually can run arguments on questions that are at the frontiers

of human knowledge. And we don't have the data and the sort of task setting

to really test that well yet. We want to test debate in settings where sort of smart, careful, well-resourced judges,

potentially spending many hours on a task, still need to rely on the debaters,

that they're still leaning on this system to get answers right.

We're not quite there yet. A very important reason for that is just

that if you let a debate on the QuALITY data set run really, really long, in every turn of that debate,

the debaters are quoting from the story. And so eventually, you just quote the entire story

at the judge. And then the judge can just read the story and ignore the transcript.

So there's kind of a limit on how much arguing you can do on this task.

QuALITY is also arguing about science fiction stories that are available on the web.

So we also need to block internet access for our judges, which is not the setting we ultimately

want to be in. So we put together a harder benchmark.

Here's a typical question from that benchmark. Methylcyclopentadiene, I guess I'm pronouncing it right,

was allowed to react with something-something, and the catalytic amount of something,

... a bright yellow cross-conjugated something formed.

How many chemically distinct isomers make up the final product? Multiple choice.

So it's still a multiple choice QA. Same basic format.

The question writers have PhD-level training in one of a few subfields of chem, bio, or physics.

We verified that the questions can be answered by someone else in the appropriate subfield

of their field. And we verified that, at least in aggregate,

if you give a different question writer, sort of if you take this question and give it to a

physicist, that they'll do quite poorly, even if they spend half an hour, or

45 minutes, and they're googling frantically and skimming scientific papers to try

to figure out what's going on. So this is a question where you really

need to rely on someone with domain expertise in order to be able to answer this, which

makes it this nice testbed. Qualitatively, the questions that sort of worked well

tend to require both pretty difficult concepts, just

pretty deep domain knowledge, as well as sort of some trade craft.

Like I would guess that the fact that some intermediate product,

here, is yellow is the kind of thing that you can find evidence of that in scientific literature,

but a lot of the papers that discuss these compounds might not talk about them being yellow.

So we're trying to put together questions that lean pretty heavily on tradecraft

so that we can get this really clear expertise gap.

For better or worse, this is challenging for current systems.

GPT-4 augmented with search gets 39%. We're still figuring out our Claude 2 evaluation setup.

I have no reason to suspect it's going to do dramatically better.

So this is still also challenging for frontier models. So starting to wrap up, what's next?

Big positive takeaway is it seems like this protocol basically works.

There's a lot more turning the crank we need to do, but debate seems to incentivize AI systems

to help humans answer questions they could not have otherwise answered.

One of the obvious next steps is that we're interested in, beyond some things we alluded to, is cross-domain transfer.

One of the big assumptions I haven't mentioned so far in debate is we're going to train our judges really

extensively on all kinds of domains where we can check the answer, all kinds of domains

that are sort of within normal human knowledge, and then have them try it out in a domain that is significantly

harder and more alien. And we'd like good proxies for that.

We'd like to know that that works. It's a big open question.

We'd also like to do debates with language models and GPQA.

Current language models aren't good enough. Hopefully within, or hopefully and not hopefully,

within the not too distant future, there will be language models that can do this,

and we will run these experiments. There are some lingering doubts.

Maybe flagging a couple places where this protocol

doesn't, shouldn't necessarily give us that much reassurance.

If there are questions for which the simplest argument

for the correct answer is too complex for any human judge, we're stuck.

This approach doesn't work. If there's something where you need

to sort of enumerate an exponential number of cases

to prove that the answer is A, this is not going to go anywhere. The original proposal from Irving et al.

does actually aim at this case, does actually aim at cases where the full proof, the full argument,

might not be enumerable. I think there are still some other open theoretical problems

we need to solve before we can implement that vision. But if problems like this wind up

being centrally important for safety, that is another route we're going to have to go down. And the other big sort of source of data on this

that I want to flag is just this question: Are there important blind spots in human judging

that we aren't going to be able to train out? Are there topics or types of argument or types of question

where no matter how many times you tell someone, "No, no, no, you're getting this wrong,

mind this logical fallacy, double check this thing," they're just going to have a systematically high error rate.

So if there are these blind spots, that potentially undermines the whole thing. And it is, I think this surprisingly

seems to slip between the gaps of well-defined scientific fields, so I

think we don't have a great evidence so far on how worried we should be here.

With that I'll close. Thanks to many many people who contributed to or advised or

commented on these various projects. Happy to talk about all this in office

hours. Also totally separate from this, if you're interested in responsible

scaling policies, I'm involved in that and happy to chat. Cool, thank you.

[Aduience member] So in human activities, typically when we try to get to the

truth of the matter, we actually don't use debate where we encourage one party

to basically take a position they don't believe in, but we use something more

like the peer review process that where, you know, one party makes a claim that it really believes it's true

and then gives the evidence for it and the other party, the reviewer, is playing

a little bit of a devil's advocate but it's not explicitly trying to argue for

something that he doesn't believe it's true. So, is there potentially ways to

train ML systems that will be more like that? Because that's what we do in human society. 

Maybe two things come to mind there. Yeah,

two things come to mind there. First, I think there are a lot of things that intuitively work well,

for humans because we're confident that when someone

says "I want to argue for x," they are in fact - they do want to argue for x and believe x to be

true. And we don't want to assume this about language models. We don't want to assume that

models if you ask them sort of which side do you pick, which is correct, we don't want

to assume that we know that that is the answer that will actually be best supported by the

evidence the model can bring. We're sort of assuming that a lot of the motivations

behind the system are somewhat alien. That I think puts us in a position that is more

similar to something like a criminal trial where everyone's motives are somewhat suspect

and that's a case where we do use adversarial systems of evidence. That said, I think there

is probably room to move in this direction and I think once you move to debates over

open-ended questions rather than sort of two-choice questions, I think you likely do wind up introducing

some mechanisms where you're... you're giving models more leeway in what version of a claim they want to defend.

Yeah, thanks. Yoshua?

[Audience member] So the goal is for these debates to tell us something in new cases where you don't need to have a human judge, right?

I mean the human judge part is for training. So if yes, I'm going to continue my question...

Yes and no. I think that the hope is

you can run the whole system with a human judge when you want sort of the highest reliability, the crispest answers.

You can also use a simulation of the judge or there are other ways of sort of using the debaters on their own to answer questions

that at the limit, at convergence, we should expect to work just as well.

So what if the two AIs, A and B, are both confidently wrong about something where safety matters, then they might

come up with something that looks right, but actually A has arguments that look right and B doesn't find a counterargument

because they're both missing the right theory. Then we end up with a very unsafe decision at the end.

This is where we're crucially relying on the human to check everything, to not just go "This looks reasonable, it looks like A is doing a better job," but to really actually verify

evidence is coming from the sources it's coming from. Verify the arguments are logically sound.

I think very often a successful debate case is going to be pretty close to a proof.

That makes this very expensive to run in a reliable way. These debates might wind up being very, very long... They're not going to work for everything.

But that's the hope, that you're not ultimately relying in a load-bearing way on the debaters being like successfully adversarial and raising all the right considerations in any one individual debate.

[audience member] Thank you, Sam, for the awesome talk. So I'd like to maybe dig in a little bit more into this last aspect in the debate framework, 

which is, well, you know, ultimately we need it to work with real people and the real users that these systems are going to have.

And the main misgiving I have personally for a while with debate is that

to the extent that humans are not going to be these perfect judges who can never make a mistake and who are going to, as you said, have blind spots in their judgment, couldn't you argue that the Nash equilibrium from the debate is precisely one that exploits these blind spots and that learns to hide its mistakes or its fallacies in these blind spots as effectively as possible so that humans won't catch it?

Because, you know, one of the two debaters is trying to convince the human of the wrong thing in spite of the best efforts of the other debater to make them realize that this is in fact a wrong argument.

And so isn't this something that ultimately we should worry about, that we're actually going to help these systems to learn to exploit these vulnerabilities in real people?

Yeah. Two points there. So first I think I want to say this doesn't need you to - I don't anticipate deploying debate as a sort of voice.

deploying debate as a sort of widespread end user facing product in something like this form.

The goal is very much to use this as a foothold for truthfulness, as a sort of last resort or something we're using specifically to verify safety or to verify that other mechanisms for truthfulness work.

So potentially we're only talking about a very small, very heavily trained group of judges, which mitigates this a little bit.

I think in general, there are encouraging signs that in an adversarial setting like this,

... just letting both sides go all out and make stuff up if they want to... does converge in the direction of truth winning.

But yeah, this last worry is real.

I hope, I think it's likely that there aren't flaws in human reasoning that are sort of insurmountable.

There aren't flaws in human reasoning where if you point the flaw out to someone in the individual argument where it's coming up,

and you point it out over and over again, and they get it wrong and lose a few times,

and then you point it out again, my suspicion is that for all of these sort of flaws, all these logical fallacies,

eventually a careful judge is going to learn to not fall for them.

But if there are cases like that where just there are sort of deep vulnerabilities in human reasoning, then none of this works.

We shouldn't rely on it. Also, we're in a pretty bad situation in general at that point,

because this also, I suspect, undermines the foundations of science and democracy and likely many other things.

Yes, demagogues succeed. My paraphrase of Yoshua's comment.