PC Benchmarks #832

ernestum · 2023-12-18T16:08:12Z

This PR contains the changes necessary to run benchmarks for the Preferences Learning algorithm.
It is also a place for planing and coordination notes on running the benchmarks.

Do a test run on astar to see if everything runs without errors.
~~Figure out how to properly run the tuning script on SLURM.~~ Decided to not go the trouble with SLURM for now. It is too much trouble for too little gain.
Maybe with slurm-launch.py and slurm-template.sh.

Right now I think this is the best approach:
Start with the slurm-template.sh and manually fill it. Call that tune_on_slurm.sh. Don't use slurm-launch.py. Make the env and the algo a parameter just like with run_benchmark_on_slurm.sh. Add a tune_all_on_slurm.sh just like run_all_benchmarks_on_slurm.sh.
Follow this tutorial and this one (note: the way the head node address is determined does not seem to work!).

Figure out what would be a good HP search space.
Run the tuning scripts

ernestum · 2024-01-05T18:11:34Z

After reading through the paper, I am using the following hyperparameter search space:

parameter	search space
active_selection	True/False
active_selection_oversampling	2 to 10
comparison_queue_size	None or 1 to total_comparisons
exploration_frac	0.0 to 0.5
fragment_length	1 to trajectory length
gatherer_kwargs
	temperature: 0 to 2
	discount_facrtor: 0.95 to 1
	sample: True/False
initial_comparison_frac	0.01 to 1
num_iterations	1 to 50
preference_model_kwargs
	noise_prob: 0 to 0.1
	discount_factor: 0.95 to 1
query_schedule	'constant', 'hyperbolic', 'inverse_quadratic'
total_comparisons	1k (750 were enough in the paper)
total_timesteps	1e7 except for pendulum then 1e6
trajectory_generator_kwargs
	exploration_frac: 0 to 0.1
	switch_prob: 0.1 to 1
	random_prob: 0.1 to 0.9
transition_oversampling	0.9 to 2
policy	pick a known good config from the zoo
reward	when active_selection is true use the reward_ensemble named config. Otherwise use default. Note the default is just 32x32 while the paper uses 64x64 networks
reward_trainer_kwargs
	epochs = 1 to 10
rl	pick a known good config from the zoo

I consider fixing active_selection=True and always using the reward ensemble because that turned out best in the paper.

ernestum marked this pull request as draft December 18, 2023 16:52

ernestum added 5 commits January 11, 2024 11:19

Fix and expand hyperparameter search space for PC.

2459817

Upgrade environment versions in the train_preference_comparisons config.

c9ccf5b

Upgrade to ray 2.9.0.

d7a7da8

Ensure that PC does at least one comparison per iteration.

55aa6eb

Add initial epoch multiplier as a parameter to the PC script.

78553c9

ernestum force-pushed the PC_benchmark branch from 1b4460c to 78553c9 Compare January 11, 2024 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PC Benchmarks #832

PC Benchmarks #832

ernestum commented Dec 18, 2023 •

edited

Loading

ernestum commented Jan 5, 2024 •

edited

Loading

PC Benchmarks #832

Are you sure you want to change the base?

PC Benchmarks #832

Conversation

ernestum commented Dec 18, 2023 • edited Loading

ernestum commented Jan 5, 2024 • edited Loading

ernestum commented Dec 18, 2023 •

edited

Loading

ernestum commented Jan 5, 2024 •

edited

Loading