Jascha Sohl-Dickstein - Adversarial examples transfer from machines to humans

Transcript

I'm Jascha. I'm gonna tell you that adversarial examples transfer from

image models to the human brain. Snd due to time, let's just move on.

One of my personal largest fears about AI in the medium term future is that it will allow targeted

manipulation of people. And I think as the capabilities of the AI increase,

so will the power of the manipulation that can be achieved.

Sometimes I find that I spend, like, two hours straight,

scrolling on Twitter with one finger and I'm like, what did I just do?

Twitter is a dead stupid ML algorithm rearranging pre-existing content.

I think if you were able to generate live the video, audio, and text targeted at my particular brain from my history of online interactions,

I wouldn't stand a chance. I will be as addicted or as outraged or buy whatever

brand of soda you tell me to buy... So I'm really worried about this. I think maybe adversarial examples

provide maybe a motivating example that this kind of extraordinarily targeted control is possible

of a neural system. Here, for instance, you like have a small perturbation you can add to an image which convinces a machine vision

classifier of absolutely anything you want it to believe. In this case that a bear is a truck.

But humans are not artificial neural networks. So does this actually apply to us?

Let's run an experiment. So what we're going to do is...

we actually ran a whole suite of different experimental conditions, but I'm just going to describe one in the, in the talk.

We had subjects look at a screen and we showed them two images and we said,

OK, which of these two images makes you think it's more like a cat?

And of course, neither of them is a cat but, but you still, you have to choose one of the two images.

We can try this ourselves. Raise your hand if you think the image on the left looks more like a cat.

OK. And now raise your hand if you think the image on the right looks more like a cat.

So that's actually a surprisingly effective demonstration because I would say there is about 50% more

hands raised for the image on the right.

And in fact, the image on the right has been adversarial perturbed in order to make computer

vision algorithms believe that it is a cat, while,

the image on the left has been adversarial perturbed to make computer vision algorithms think that

it is a truck. I actually am doing better on time. So I'm going to just leave these up for like five seconds.

You can like try to find differences. This is an epsilon equals two perturbation.

Let me just show you our results. The results are that in fact, subtle adversarial manipulations that work on an ensemble

of computer vision algorithms after additional like geometric augmentation transfer to humans. Here, this plot,

the X axis is the perturbation magnitude of the adversarial example. The effect gets stronger,

the larger the perturbation you allow. The image you saw was perturbation magnitude two.

The dashed line is chance performance and the Y-axis is how much we are able to bias human perception.

And you can see even in epsilon equals two, it's like a 2 to 3% bias in in human perception.

This is super cool scientifically maybe because it suggests that there are

even closer and more surprising correspondences between subtle

behaviors of artificial neural networks and the human brain. It's also maybe quite worrying

because it suggests that some of the more sci-fi strong targeted manipulations that we are able to do in order to make artificial

neural networks behave in bad ways also transfer to some degree to the

human brain and the human brain may be susceptible to similar things.

So we should worry more about manipulative superstimuli targeted at us.