Victor Veitch - What (and Why) is a 'Linear' representation?
This talk is about the so-called "linear representation hypothesis"
which is a piece of widely believed folk wisdom, which is that there's this
miraculous thing which is, like, high level concepts that are semantically meaningful to humans,
are somehow linearly represented in the internal spaces of large language models.
And this is in particular about my desire to understand whether this is true.
Although actually, I quickly abandoned trying to decide whether or not it was true, and started trying to understand what it
even means, right? So there's a problem in understanding this, which is "What is a linear representation?"
There are at least three distinct things that people have looked at in the literature.
The first one is this old Word2Vec notion, which is like, king minus queen is parallel to man minus woman and so forth.
And so there's a direction corresponding to the concept of sex.
The second one is that you can use logistic regression. like a logit linear model, to try to predict high level semantic concepts.
Like, "Is the output gonna be English or French?" You can just predict that using logistic regression.
The third one is the existence of steering vectors, where given the representation produced by some model,
there is some vector that you can add to that thing, which will change some semantically-meaningful attribute in isolation.
OK. And the problem here is these three things a priori do not obviously have anything to do with each other.
And it's not obvious which is the right notion, or why they should connect at all in as much as they do.
OK. So that's the question we're going to try and answer. In order to get anywhere with this, the first thing we need to do is say what a high level concept is.
That's hard in general, but we can specialize a little bit, to a language that's gonna allow us to say something.
The idea here is that we're gonna use tools developed in causal inference and causal representation learning
to give a notion of concept which can be precisely articulated in a way that is now going to
allow us to prove theorems and make sharp predictions.
So, for us, a concept is any factor of variation that can be changed in isolation to affect the output?
These don't need to have an ontological status for the model, right?
They're just things that we can articulate that, right?
So we're gonna consider the running example of sex and language. So, is this male or female, is this English or French?
These are things that we can change in a way that coherently affects the output.
We can view these either as latent variables in between the prompt and the output, or as
counterfactual pairs of outputs, where we say things like king and queen vary exactly on this concept.
And so that illustrates what the concept is.
Once we have a language, we can give a formal notion of what a linear representation is.
We're going to define things in terms of the subspace notion.
But the first insight here is that in the context of large language models,
there are actually two distinct representation spaces which we ought to consider... or at least two.
One, which I'm calling the embedding space, is the output of the large language model itself.
This is the encoding of prompts that go in on the input.
The other, which I'm calling the unembedding space, is the representations of the
words or the output space on the word level space.
And once we have the insight that there are these two different representation spaces,
and we have this like causal notion of what a concept is, there's sort of an obvious way of defining two different notions of subspace representation.
In the interest of time, I won't say what those are.
But if you think about what you ought to write down at this point, this is the thing that you would write down.
OK. What we get once we have something written down
- once we have a language to reason about things - is a pretty immediate result which says,
as a consequence of the softmax link function, we get that the unembedding subspace representation
will act as a logit linear probe for the concept that is being represented,
And the embedding subspace representation will act as a steering vector for the
concept that we are interested in. So this unifies these measurement and intervention notions with the subspace notion.
So that still leaves us with two problems.
... Well, three because of the timer. Two problems. So first is: we still have these two distinct representation spaces.
How do they relate to each other? That's a peculiarity.
The other one is: the only reason we care about linear representations is that we can do things like measure cosine similarity in order to interpret it.
Everything we care about with these things requires some inner product.
But it's not at all clear what inner product structure we ought to use on these spaces. And so the
idea here ... and sorry I'm gonna go 30 seconds over, but I think it's forgivable in context...
So the idea that we're gonna use here is to say an inner product should have the special property that if we have two high level concepts
that can be varied fully freely of each other... So, concepts like sex and output language are distinct...
then those things should be represented orthogonally in the representation space, right?
So the Euclidean inner product generally does not have this property,
but it's a desiderata that we're going to input... then the result is basically that if you have such
an inner product, then you get an automatic unification of the embedding and unembedding
representations to the Riesz representation theorem.
So with a suitable inner product, everything collapses.
And so the, the overall message... I'll just skip the slide mostly...
But the overall message is that the softmax link function plus a suitably chosen inner product causes
every notion of linear to collapse into one, right. And this is, at least, a well defined notion.
There's also a question of why this structure emerges in the first place.
We have a separate paper on this, which we used to give a totally nonparametric generalization of the idea.