Dawn Song - Challenges for AI safety in adversarial settings

Transcript

Thanks everyone. My name is Dawn Song.

I'm also from UC Berkeley. I'm a director at um Berkeley Center for

Responsible Decent - Decentralized Intelligence and also affiliated with the CHAI and 

BAIR. So it's been, it's really exciting to

talk about machine learning all the great advancement and so on. But coming from security background,

one message I really want to get across to this audience is that as we talk about all 

the great things machine learning can do, it's really really important to consider the development and deployment of

machine learning in the presence of attackers. So history,

especially in cybersecurity has really shown that the attackers always follow the footsteps 

of new technology development or sometimes even lead it and also this time the 

stake with the AI is even higher as AI can choose among more and more systems,

attacker will have a higher and higher incentives um to attack these systems.

And also as AI becomes more and more capable,

the consequence of misuse by attackers will also become more and more severe.

And hence, as we talk about AI safety, it's really important to consider AI safety in the adversarial setting.

So, adversarial examples has been shown to be prevalent in deep learning systems.

Um but my group has done a lot of work in this space and many uh uh and many other 

researchers also in this audience have done a lot of work in this space.

And pretty much we have shown that adversarial examples are prevalent to essentially all 

different tasks and model classes in deep learning. And also there are different types of threat models,

including black box and white box attacks that can be very effective and even attacks

can be effective in the physical real world as well.

And it's great to see that the number of papers essentially has increased 

exponentially in the in the area of adversarial examples.

And and also this work has also helped raise a lot of the awareness in the public as well.

So some of our work, the artifact from our physical adversarial examples are now actually part of the

permanent collection at the Science Museum of London.

And also adversarial attacks can happen at different stages of machine learning pipelines 

including at inference time as well as pre-training and fine-tuning stages.

So as we talk about AI safety and AI alignment,

it's important to look at what adversarial attacks can do in the setting,

in particular, on safety-aligned LLMs.

Unfortunately, recent work has shown that really the safety-aligned LLMs are pretty

brittle under adversarial attacks. This is our recent work DecodingTrust,

which is it's the first comprehensive trustworthiness evaluation framework for large language models.

And essentially we developed new evaluation datasets as well as new protocols for evaluating

trustworthiness across eight different perspectives for 

trustworthiness for LLMs. Including toxicity, stereotypes, robustness, privacy,

and fairness and many others. And in the development of our evaluation framework,

we specifically set out to both include the 

evaluation under benign environments as well as adversarial environments.

And our work showed that for these large language models - so for example

GPT-4 can perform better under benign environments, for example,

than GPT-3.5. But actually in adversarial environments,

GPT-4 could actually even be more vulnerable than GPT-3.5, potentially because 

it's actually better at following instructions.

So it's actually in something easier to jailbreak. And given the interest 

of time I won't go through these - these are examples for these different perspectives.

And also you can learn more information at decodingtrust.io and also come 

to our presentation on Tuesday morning.

And also there are other researchers' work demonstrating other types of 

adversarial attacks, so our work also shows that, for example,

the attacks on open source models can be transferred to closed source models.

And Zico is going to talk more about their attack on universal adversarial attacks on 

tomorrow as well. And other works have shown that adversarial attacks multi-model

models and also adversarial fine tuning where you can actually use 

a very few adversarially-designed training examples 

through fine-tune to completely break safety-aligned LLMs.

The key message that I want to get across here is that in adversarial attacks,

the field has made tremendous, tremendous progress.

We have collected and come out with so many different types of attacks.

But however, for adversarial defenses, the progress has been extremely slow.

So literally, we really don't have effective general adversarial defenses today.

And hence, for AI safety, it's really important for the AI to have safety mechanisms to be designed to be resilient

against adversarial attacks. And this actually poses a significant

challenge for AI safety and I think my time is up so just one word to summarize...

...Partially to address this issue.

In some recent work, also with the collaborator here, and Dan Hendrycks led this work with

some other collaborators here,

we have a new work on representation engineering to essentially look at 

the internal representations - activations - from the LLM and showing that we can identify certain

directions um that can help identify whether the LLM is more honest or is more truthful for

reading and using these internal directions, we can also control the model behavior.