Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sanity check: PPO log_ratio should be zero when training is disabled #508

Closed
TobiasNorlund opened this issue Jun 21, 2023 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@TobiasNorlund
Copy link
Contributor

TobiasNorlund commented Jun 21, 2023

🐛 Describe the bug

As a sanity check, the log ratio (logprobs - old_logprobs) * mask in PPO (https://github.com/CarperAI/trlx/blob/main/trlx/models/modeling_ppo.py#L200) should be (close to) zero if training is disabled (i.e. learning rate is set to zero). I have discovered this to not be the case when method.chunk_size does not equal method.num_rollouts.

Reproduction

I've created a trlx fork in which a print(torch.abs(log_ratio).max()) is added to print the max log_ratio deviation from zero at each training step.

When running the ppo_sentiment.py example script with learning rate set to zero, the log_ratios are close to zero as expected:

python examples/ppo_sentiments.py '{"optimizer": {"kwargs": {"lr": 0}}}'
...
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 24/1600 [00:03<03:39,  7.19it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 26/1600 [00:03<03:18,  7.91it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 26/1600 [00:03<03:18,  7.91it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 28/1600 [00:03<03:04,  8.52it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.15 | losses/policy_loss: 0.02 | losses/value_loss: 0.14]:   2%| 28/1600 [00:03<03:04,  8.52it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.15 | losses/policy_loss: 0.02 | losses/value_loss: 0.14]:   2%| 30/1600 [00:03<02:54,  8.99it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
...    

As expected, we see small values close to zero.

However, if we decrease method.chunk_size to something smaller than default (= 128 == num_rollouts), i.e. 32, the log_ratios become much larger.

python examples/ppo_sentiments.py '{"optimizer": {"kwargs": {"lr": 0}}, "method": {"chunk_size": 32}}'
...
tensor(11.6890, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.68 | losses/policy_loss: 0.56 | losses/value_loss: 0.12]:   1%| 22/1600 [00:03<06:07,  4.30it/s]
tensor(11.6890, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.68 | losses/policy_loss: 0.56 | losses/value_loss: 0.12]:   2%| 24/1600 [00:04<04:59,  5.27it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 24/1600 [00:04<04:59,  5.27it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 26/1600 [00:04<04:12,  6.24it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 26/1600 [00:04<04:12,  6.24it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 28/1600 [00:04<03:39,  7.17it/s]
tensor(11.3957, device='cuda:0', grad_fn=<MaxBackward1>)

Expected result

Irregardless of chunk_size, the log_ratio should be close to zero.

Which trlX version are you using?

commit hash: 0dce99d

Additional system and package information

Python 3.10.11, transformers==4.29.2

@TobiasNorlund TobiasNorlund added the bug Something isn't working label Jun 21, 2023
@TobiasNorlund
Copy link
Contributor Author

I think there are two causes of this behavior:

  1. For models with absolute positional embeddings, such as gpt2 used in the example, the tokens get different positional embeddings due to different left padding in make_experience(...) vs loss(...) in accelerate_ppo_trainer.py. This result in different logprobs yielding non-zero log_ratio.
  2. I think there might also be a bug when computing the mask applied to log_ratio. I think it should be shifted by one to also cover the last token.

Only with both of these changes (see #509), I get the expected (close to) zero log_ratio when runnig with modified chunk_size as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants