Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PC Benchmarks #832

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

PC Benchmarks #832

wants to merge 5 commits into from

Conversation

ernestum
Copy link
Collaborator

@ernestum ernestum commented Dec 18, 2023

This PR contains the changes necessary to run benchmarks for the Preferences Learning algorithm.
It is also a place for planing and coordination notes on running the benchmarks.

  • Do a test run on astar to see if everything runs without errors.
  • Figure out how to properly run the tuning script on SLURM. Decided to not go the trouble with SLURM for now. It is too much trouble for too little gain.
    Maybe with slurm-launch.py and slurm-template.sh.

Right now I think this is the best approach:
Start with the slurm-template.sh and manually fill it. Call that tune_on_slurm.sh. Don't use slurm-launch.py. Make the env and the algo a parameter just like with run_benchmark_on_slurm.sh. Add a tune_all_on_slurm.sh just like run_all_benchmarks_on_slurm.sh.
Follow this tutorial and this one (note: the way the head node address is determined does not seem to work!).

  • Figure out what would be a good HP search space.
  • Run the tuning scripts

@ernestum ernestum marked this pull request as draft December 18, 2023 16:52
@ernestum
Copy link
Collaborator Author

ernestum commented Jan 5, 2024

After reading through the paper, I am using the following hyperparameter search space:

parameter search space
active_selection True/False
active_selection_oversampling 2 to 10
comparison_queue_size None or 1 to total_comparisons
exploration_frac 0.0 to 0.5
fragment_length 1 to trajectory length
gatherer_kwargs
temperature: 0 to 2
discount_facrtor: 0.95 to 1
sample: True/False
initial_comparison_frac 0.01 to 1
num_iterations 1 to 50
preference_model_kwargs
noise_prob: 0 to 0.1
discount_factor: 0.95 to 1
query_schedule 'constant', 'hyperbolic', 'inverse_quadratic'
total_comparisons 1k (750 were enough in the paper)
total_timesteps 1e7 except for pendulum then 1e6
trajectory_generator_kwargs
exploration_frac: 0 to 0.1
switch_prob: 0.1 to 1
random_prob: 0.1 to 0.9
transition_oversampling 0.9 to 2
policy pick a known good config from the zoo
reward when active_selection is true use the reward_ensemble named config. Otherwise use default. Note the default is just 32x32 while the paper uses 64x64 networks
reward_trainer_kwargs
epochs = 1 to 10
rl pick a known good config from the zoo

I consider fixing active_selection=True and always using the reward ensemble because that turned out best in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant