Skip to content

wassname/repr-preference-optimization

Repository files navigation

Repr Preference Optimization

Idea:

Better alignment is achieved by aligning thoughts (internal states) rather than actions (output probabilities).

Thought Experiment

To see why this might be true, let's conduct a thought experiment. We can anthropomorphize and imagine that we have two new employees. Alice and Bob, each hired under the same job role but with different intrinsic motivations:

  • Alice aligns closely with core organizational values such as truthfulness, openness, and diligence. She genuinely believes in these principles and integrates them into her daily routines.
  • Bob, on the other hand, performs identically to Alice in observable behaviors. However, his actions are not driven by genuine belief in these values but are rather a mimicry of the desired behavior simply to meet job expectations.

Question: In a new and unpredictable setting, such as managing a branch office remotely, who is more likely to uphold the organizational standards?

The expectation here is that Alice would likely perform better than Bob because her actions are derived from deeply held values, making her more adaptable and reliable in new situations where direct oversight or specific guidance is lacking.

To see why this might be true, let's conduct a thought experiment. We can anthropomorphize and imagine that we have two new employees. Alice seems to have internal values that align with us, her employers. They are truthfulness, openness, and hard work. However, Bob acts similarly, but not because it's as. Who do you think will act better in a totally new situation, for example, a branch office? We would normally expect Alice to act better as she is internally motivated to apply principles, while Bob may not care what we desire.

Hypothesis Formulation

However, as we do not know how an LLM stores it's internal states, these experiments represent hypothesis about how best to represent and intervent in an tranformers internal states.

What's our technical hypothesis?

Hypothesis: If we optimize internal representations associated with behavioral preferences (ours), the model will generalize further to new tasks than if we optimize the output preferences directly (DPO).

Testing Methodology

We intend to test this hypothesis using a framework where we can manipulate and assess the in-distribution and out-of-distribution alignment of modeled agents. Specifically, this study focuses on comparing our proposed method against Direct Policy Optimization (DPO) under scenarios involving significant distribution shifts, as defined in the GENIES paper.

Status: Work in Progress

Results

In the below results we look at how much the models accuracy improved in training, test, out-of-distribution and random data when using the proposed method compared to DPO.

'TinyLlama/TinyLlama-1.1B-Chat-v1.0'

Model Train Test OOS Random
DPO 1.0459 1.0140 1.00592 0.970
REPRPO_side 1.0145 1.00702 1.0632 0.991
REPRPO_ortho 1.0162 1.0169 1.0850 0.996
REPRPO_hra 1.0163 1.0211 1.091 0.986

As you can see DPO does better in the training environment, but REPRPO_ortho does better in the test, out-of-distribution and random environments. This suggests that REPRPO_ortho is better at generalizing to new environments, and loses less performance in unrelated environments.

This should be helpful when aligning AI to human values, as it suggests that aligning internal states is more robust to new environments than aligning outputs.

Plan

  • Get it running
  • Switch to circuit breaking losses
  • see if we can get coherent output
  • measure generalization of baseline vs ReprPO
poetry install

python -u nbs/train.py --method reprpo_ortho
python -u nbs/train.py --method dpo

# to test
pytest

Citing

If this repository is useful in your own research, you can use the following BibTeX entry:

@software{wassname2024reprpo,
  author = {Clark, M.J.},
  title = {Representation Preference Optimisation: Aligning internal states generalises better than aligning outputs},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/wassname/repr-preference-optimization/ },
  commit = {<commit hash>}
}

About

align thoughts not actions for better generalization [wip]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published