{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":665967916,"defaultBranch":"main","name":"trlx","ownerLogin":"RobertKirk","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-07-13T12:02:05.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/9707177?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1689756147.0","currentOid":""},"activityList":{"items":[{"before":"99de1663453731d270a7760da13255ea0885b9b9","after":"87aa33198f7511bef49f4d49816da5c64baf1f23","ref":"refs/heads/main","pushedAt":"2023-07-25T12:34:24.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"RobertKirk","name":"Robert Kirk","path":"/RobertKirk","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9707177?s=80&v=4"},"commit":{"message":"Make create_train_dataloader an abstractmethod","shortMessageHtmlLink":"Make create_train_dataloader an abstractmethod"}},{"before":"ea7c2b0e92133c5bce2997d16b4a497d3dc0e0b8","after":"99de1663453731d270a7760da13255ea0885b9b9","ref":"refs/heads/main","pushedAt":"2023-07-24T10:02:50.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"maxreciprocate","name":"Max","path":"/maxreciprocate","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/56548574?s=80&v=4"},"commit":{"message":"Merge branch 'main' into fix-ppo-order","shortMessageHtmlLink":"Merge branch 'main' into fix-ppo-order"}},{"before":"10369a198d4841dfa2a3e382516cdaa5aaf7f2d4","after":"ea7c2b0e92133c5bce2997d16b4a497d3dc0e0b8","ref":"refs/heads/main","pushedAt":"2023-07-22T14:47:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"maxreciprocate","name":"Max","path":"/maxreciprocate","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/56548574?s=80&v=4"},"commit":{"message":"revert(accelerate_ilql_trainer): remove `shuffle` option","shortMessageHtmlLink":"revert(accelerate_ilql_trainer): remove <code>shuffle</code> option"}},{"before":"8a943d90afd6b791420750e073c760d8c898f3a9","after":"10369a198d4841dfa2a3e382516cdaa5aaf7f2d4","ref":"refs/heads/main","pushedAt":"2023-07-22T14:19:11.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"maxreciprocate","name":"Max","path":"/maxreciprocate","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/56548574?s=80&v=4"},"commit":{"message":"fix(accelerate_trainer): fill a missing `shuffle` argument","shortMessageHtmlLink":"fix(accelerate_trainer): fill a missing <code>shuffle</code> argument"}},{"before":"fe3368143336ca19160f43bc2ac3ffce8661ee2a","after":"8a943d90afd6b791420750e073c760d8c898f3a9","ref":"refs/heads/main","pushedAt":"2023-07-19T08:43:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"RobertKirk","name":"Robert Kirk","path":"/RobertKirk","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9707177?s=80&v=4"},"commit":{"message":"Reset train dataloder at each iteration\n\nThis way we get better shuffling. Note that we now pass shuffle=True\n(implicitly) in the ppo trainer, whereas before we had shuffle=False.\nShuffling is better here, as it means the gradient estimation over\nminibatches is less correlated.","shortMessageHtmlLink":"Reset train dataloder at each iteration"}},{"before":null,"after":"8a943d90afd6b791420750e073c760d8c898f3a9","ref":"refs/heads/fix-ppo-order","pushedAt":"2023-07-19T08:42:27.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"RobertKirk","name":"Robert Kirk","path":"/RobertKirk","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9707177?s=80&v=4"},"commit":{"message":"Reset train dataloder at each iteration\n\nThis way we get better shuffling. Note that we now pass shuffle=True\n(implicitly) in the ppo trainer, whereas before we had shuffle=False.\nShuffling is better here, as it means the gradient estimation over\nminibatches is less correlated.","shortMessageHtmlLink":"Reset train dataloder at each iteration"}},{"before":"144652381e8f51cc97962521587fbe1dff42a991","after":"fe3368143336ca19160f43bc2ac3ffce8661ee2a","ref":"refs/heads/main","pushedAt":"2023-07-13T12:08:17.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"RobertKirk","name":"Robert Kirk","path":"/RobertKirk","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9707177?s=80&v=4"},"commit":{"message":"Fix ordering of ppo epoch iteration\n\nSuppose you have 256 rollouts, and you batch them into batches B1, B2,\nB3, B4 (each of size 64). The order of gradient updates (assuming 3\nppo_epochs) is:\n\n`trlx: B1 B1 B1 B2 B2 B2 B3 B3 B3 B4 B4 B4`\n\nHowever, what we should actually be doing (and what alpaca-farm and\nother rlhf implementations, and standard implementations of PPO do), is\n\n`improved: B1 B2 B3 B4 B1 B2 B3 B4 B1 B2 B3 B4`\n\nIt would be even better if we actually produced new random batches at\neach ppo_epoch, that would require more refactoring. i.e.:\n\n`optimal: B1 B2 B3 B4 B1' B2' B3' B4' B1* B2* B3* B4*`\n\nThis change basically just reorders the learning to make the code use\nthe `improved` ordering above. It also renames n_updates_per_batch to\nn_inner_epochs as that's a more accurate description (especially now),\nadjusts forward_time and backward_time to not type-error, and renames\nmbs and mb to minibatch and microbatch (as that's what they are).","shortMessageHtmlLink":"Fix ordering of ppo epoch iteration"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"Y3Vyc29yOnYyOpK7MjAyMy0wNy0yNVQxMjozNDoyNC4wMDAwMDBazwAAAANcrMDi","startCursor":"Y3Vyc29yOnYyOpK7MjAyMy0wNy0yNVQxMjozNDoyNC4wMDAwMDBazwAAAANcrMDi","endCursor":"Y3Vyc29yOnYyOpK7MjAyMy0wNy0xM1QxMjowODoxNy4wMDAwMDBazwAAAANUzCal"}},"title":"Activity · RobertKirk/trlx"}