sanity check: PPO `log_ratio` should be zero when training is disabled #508

TobiasNorlund · 2023-06-21T09:11:48Z

🐛 Describe the bug

As a sanity check, the log ratio (logprobs - old_logprobs) * mask in PPO (https://github.com/CarperAI/trlx/blob/main/trlx/models/modeling_ppo.py#L200) should be (close to) zero if training is disabled (i.e. learning rate is set to zero). I have discovered this to not be the case when method.chunk_size does not equal method.num_rollouts.

Reproduction

I've created a trlx fork in which a print(torch.abs(log_ratio).max()) is added to print the max log_ratio deviation from zero at each training step.

When running the ppo_sentiment.py example script with learning rate set to zero, the log_ratios are close to zero as expected:

python examples/ppo_sentiments.py '{"optimizer": {"kwargs": {"lr": 0}}}'
...
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 24/1600 [00:03<03:39,  7.19it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 26/1600 [00:03<03:18,  7.91it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 26/1600 [00:03<03:18,  7.91it/s]
tensor(9.2506e-05, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.09 | losses/policy_loss: -0.02 | losses/value_loss: 0.11]:   2%| 28/1600 [00:03<03:04,  8.52it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.15 | losses/policy_loss: 0.02 | losses/value_loss: 0.14]:   2%| 28/1600 [00:03<03:04,  8.52it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.15 | losses/policy_loss: 0.02 | losses/value_loss: 0.14]:   2%| 30/1600 [00:03<02:54,  8.99it/s]
tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)
...

As expected, we see small values close to zero.

However, if we decrease method.chunk_size to something smaller than default (= 128 == num_rollouts), i.e. 32, the log_ratios become much larger.

python examples/ppo_sentiments.py '{"optimizer": {"kwargs": {"lr": 0}}, "method": {"chunk_size": 32}}'
...
tensor(11.6890, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.68 | losses/policy_loss: 0.56 | losses/value_loss: 0.12]:   1%| 22/1600 [00:03<06:07,  4.30it/s]
tensor(11.6890, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.68 | losses/policy_loss: 0.56 | losses/value_loss: 0.12]:   2%| 24/1600 [00:04<04:59,  5.27it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 24/1600 [00:04<04:59,  5.27it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 26/1600 [00:04<04:12,  6.24it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 26/1600 [00:04<04:12,  6.24it/s]
tensor(13.6676, device='cuda:0', grad_fn=<MaxBackward1>)
[losses/total_loss: 0.18 | losses/policy_loss: 0.04 | losses/value_loss: 0.14]:   2%| 28/1600 [00:04<03:39,  7.17it/s]
tensor(11.3957, device='cuda:0', grad_fn=<MaxBackward1>)

Expected result

Irregardless of chunk_size, the log_ratio should be close to zero.

Which trlX version are you using?

commit hash: 0dce99d

Additional system and package information

Python 3.10.11, transformers==4.29.2

The text was updated successfully, but these errors were encountered:

TobiasNorlund · 2023-06-21T09:57:57Z

I think there are two causes of this behavior:

For models with absolute positional embeddings, such as gpt2 used in the example, the tokens get different positional embeddings due to different left padding in make_experience(...) vs loss(...) in accelerate_ppo_trainer.py. This result in different logprobs yielding non-zero log_ratio.
I think there might also be a bug when computing the mask applied to log_ratio. I think it should be shifted by one to also cover the last token.

Only with both of these changes (see #509), I get the expected (close to) zero log_ratio when runnig with modified chunk_size as above.

TobiasNorlund added the bug Something isn't working label Jun 21, 2023

TobiasNorlund mentioned this issue Jun 21, 2023

Fix PPO log_ratio bug #509

Merged

maxreciprocate closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sanity check: PPO `log_ratio` should be zero when training is disabled #508

sanity check: PPO `log_ratio` should be zero when training is disabled #508

TobiasNorlund commented Jun 21, 2023 •

edited

Loading

TobiasNorlund commented Jun 21, 2023

sanity check: PPO log_ratio should be zero when training is disabled #508

sanity check: PPO log_ratio should be zero when training is disabled #508

Comments

TobiasNorlund commented Jun 21, 2023 • edited Loading

🐛 Describe the bug

Reproduction

Expected result

Which trlX version are you using?

Additional system and package information

TobiasNorlund commented Jun 21, 2023

sanity check: PPO `log_ratio` should be zero when training is disabled #508

sanity check: PPO `log_ratio` should be zero when training is disabled #508

TobiasNorlund commented Jun 21, 2023 •

edited

Loading