Hi, everybody. Your RLHF fine tuning is secretly applying a regret preference model.
Let me explain. So... let me describe a slide I inserted and I guess didn't make into the final cut here. So from the
perspective of an RLHF algorithm, there is a human who has a hidden reward function, and they
are giving preferences according according to a preference model. They're sampling this preference model, and that's what creates the preference data set.
And then MLE is used with another preference model, or usually the same, one to learn a reward
function or sometimes something else like a policy.
So this preference model, the predominant preference model, is what we call partial return.
And what that is, is that the probability of a person preferring trajectory segment 1
over trajectory segment 2 is equal to the logistic function
with the input of the sum of reward of trajectory segment 1, minus the sum of reward for trajectory segment 2.
We're ignoring discounting and Boltzmann temperature for simplicity here.
So let's look at a simple example here we have a grid world. The purpose is to get to the goal as fast as possible.
And so an aligned reward function is a time penalty of negative one every time step
Here, the sum of reward, the partial return, is the same.
So the preference model is indifferent between these two trajectory segments.
Even though humans strongly tend to prefer the one on the right.
If you prefer differently, please talk to me afterwards.
So what's different here? The one on the right is optimal,
the one on the right also has a different in-state value.
So with those insights and some others in mind, we propose the regret preference model and the difference here is that
we swap out the reward function for the optimal advantage function.
And we're going to some optimal advantages of every transition in each of these trajectory segments.
And this the summation when negated is what we call regret. But at a high level,
really what you need to know is that the regret of a trajectory segment is measuring how much
it deviates from optimal behavior. So we're measuring deviation from optimality.
Looking at our example again the one on the right is optimal.
So the regret preference model prefers it like humans tend to,
we compare these two preference models, we find that the regret one is theoretically superior in terms of identifiable and
also with human preferences that we gather, it's more descriptive and learns more aligned reward functions.
So why does the partial return preference model which really is the predominant one do so?
Well? For fine tuning, our hypothesis is that annotators do give regret based preferences.
And engineers using fine tuning are unknowingly applying the regret preference model.
And so this match of preference models creates some alignment. So let me explain or dive into this hypothesis here.
The multi-turn language problem involves a human giving a prompt and then the language model
responding and then prompt response, prompt response.
And so on, the way RLHF is is done in this setting is that the partial return preference model is assumed, at least nominally, and
the trajectory segment length is one. So one action, or one response is what is being evaluated. Interestingly,
the learned reward function is applied as if in a bandit task. However, this is not a bandit problem,
this is a sequential problem. And so this decision rule that you see at the very bottom right,
that is being applied for RLHF fine tuning. To derive that from a sequential task,
we have to assume that the discount factor gamma equals zero to make the the next state
value drop off. And this is an arbitrary assumption for an under-specified problem.
If instead we assume that preferences are from regret, then instead of learning the reward function,
what we've learned is an approximation of the optimal advantage function.
And we end up with the exact same decision rule without having to assume anything about the discount factor.
And that decision rule is that we're arg maxing the learned function, which,
in this case is interpreted to be the optimal advantage function.
So we get the same fine tuning algorithm with a better supported preference model.
And without the arbitrary assumption that this discount factor is zero. So you might say so what the algorithm is the same? Why?
Why does this matter? Well, if the segment length is larger than one, if you're evaluating more than one response in a trajectory segment,
and you assume this discount factor of zero, then the partial return preference model nonsensically ignores all actions after the first.
Whereas the regret preference model has a more reasonable algorithm that results.
And then also, as a scientific act of faith, I believe that a clearer understanding and a simpler understanding of our algorithms will bear fruit later.
Last, I'll just end with a teaser. This is the latest work that uses the regret preference model.
It generalizes DPO and learns a policy directly in the max-ent RL setting rather than learning a reward function first.
And so like DPO, it avoids having to sample from the environment after learning from preferences.
And we scale up to these these robotic manipulation tasks in the paper that's on archive.
I'll end with that, and just one quick plug for Stephane Hatgis-Kessel.
I think a number of you in this room, he will be applying to your research groups and he would be a great find.
He was co-first author on this work. Thank you.