Repr Preference Optimization

Idea:

Better alignment is achieved by aligning thoughts (internal states) rather than actions (output probabilities).

Thought Experiment

To see why this might be true, let's conduct a thought experiment. We can anthropomorphize and imagine that we have two new employees. Alice and Bob, each hired under the same job role but with different intrinsic motivations:

Alice aligns closely with core organizational values such as truthfulness, openness, and diligence. She genuinely believes in these principles and integrates them into her daily routines.
Bob, on the other hand, performs identically to Alice in observable behaviors. However, his actions are not driven by genuine belief in these values but are rather a mimicry of the desired behavior simply to meet job expectations.

Question: In a new and unpredictable setting, such as managing a branch office remotely, who is more likely to uphold the organizational standards?

The expectation here is that Alice would likely perform better than Bob because her actions are derived from deeply held values, making her more adaptable and reliable in new situations where direct oversight or specific guidance is lacking.

To see why this might be true, let's conduct a thought experiment. We can anthropomorphize and imagine that we have two new employees. Alice seems to have internal values that align with us, her employers. They are truthfulness, openness, and hard work. However, Bob acts similarly, but not because it's as. Who do you think will act better in a totally new situation, for example, a branch office? We would normally expect Alice to act better as she is internally motivated to apply principles, while Bob may not care what we desire.

Hypothesis Formulation

However, as we do not know how an LLM stores it's internal states, these experiments represent hypothesis about how best to represent and intervent in an tranformers internal states.

What's our technical hypothesis?

Hypothesis: If we optimize internal representations associated with behavioral preferences (ours), the model will generalize further to new tasks than if we optimize the output preferences directly (DPO).

Testing Methodology

We intend to test this hypothesis using a framework where we can manipulate and assess the in-distribution and out-of-distribution alignment of modeled agents. Specifically, this study focuses on comparing our proposed method against Direct Policy Optimization (DPO) under scenarios involving significant distribution shifts, as defined in the GENIES paper.

Status: Work in Progress

Results

In the below results we look at how much the models accuracy improved in training, test, out-of-distribution and random data when using the proposed method compared to DPO.

'TinyLlama/TinyLlama-1.1B-Chat-v1.0'

Model	Train	Test	OOS	Random
DPO	1.0459	1.0140	1.00592	0.970
REPRPO_side	1.0145	1.00702	1.0632	0.991
REPRPO_ortho	1.0162	1.0169	1.0850	0.996
REPRPO_hra	1.0163	1.0211	1.091	0.986

As you can see DPO does better in the training environment, but REPRPO_ortho does better in the test, out-of-distribution and random environments. This suggests that REPRPO_ortho is better at generalizing to new environments, and loses less performance in unrelated environments.

This should be helpful when aligning AI to human values, as it suggests that aligning internal states is more robust to new environments than aligning outputs.

Plan

Get it running
Switch to circuit breaking losses
see if we can get coherent output
measure generalization of baseline vs ReprPO

poetry install

python -u nbs/train.py --method reprpo_ortho
python -u nbs/train.py --method dpo

# to test
pytest

Citing

If this repository is useful in your own research, you can use the following BibTeX entry:

@software{wassname2024reprpo,
  author = {Clark, M.J.},
  title = {Representation Preference Optimisation: Aligning internal states generalises better than aligning outputs},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/wassname/repr-preference-optimization/ },
  commit = {<commit hash>}
}

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.vscode		.vscode
configs		configs
docs/img		docs/img
files		files
nbs		nbs
reprpo		reprpo
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
boostrap.sh		boostrap.sh
image-1.png		image-1.png
image.png		image.png
justfile		justfile
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
research_journal.md		research_journal.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repr Preference Optimization

Thought Experiment

Hypothesis Formulation

Testing Methodology

Results

Plan

Citing

About

Releases

Packages

Contributors 2

Languages

wassname/repr-preference-optimization

Folders and files

Latest commit

History

Repository files navigation

Repr Preference Optimization

Thought Experiment

Hypothesis Formulation

Testing Methodology

Results

Plan

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages