sanity check: PPO log_ratio
should be zero when training is disabled
#508
Labels
bug
Something isn't working
log_ratio
should be zero when training is disabled
#508
🐛 Describe the bug
As a sanity check, the log ratio
(logprobs - old_logprobs) * mask
in PPO (https://github.com/CarperAI/trlx/blob/main/trlx/models/modeling_ppo.py#L200) should be (close to) zero if training is disabled (i.e. learning rate is set to zero). I have discovered this to not be the case whenmethod.chunk_size
does not equalmethod.num_rollouts
.Reproduction
I've created a trlx fork in which a
print(torch.abs(log_ratio).max())
is added to print the max log_ratio deviation from zero at each training step.When running the
ppo_sentiment.py
example script with learning rate set to zero, the log_ratios are close to zero as expected:As expected, we see small values close to zero.
However, if we decrease
method.chunk_size
to something smaller than default (= 128 == num_rollouts), i.e. 32, the log_ratios become much larger.Expected result
Irregardless of
chunk_size
, the log_ratio should be close to zero.Which trlX version are you using?
commit hash: 0dce99d
Additional system and package information
Python 3.10.11, transformers==4.29.2
The text was updated successfully, but these errors were encountered: