-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate convert_to_singleton #691
Comments
|
@andrewPoulton Is this true though? I was unable to convert 8 shards successfully to
|
@ayeeyecorp Can you share the stack trace? I suspect it might be related to the fact that the checkpoints available on the OPT page are flattened, which are are not compatible with |
@tangbinh let's add a flat param check to reshard_*, and raise an error unless user specifically wants to unflatten. I'll create an issue to track in a bit. Happy to own as well. |
@andrewPoulton I was adding an option to split the KVQ weights in
Once we fixed #625, I think we can safely remove |
@andrewPoulton - I did not save the stack trace from that particular test - can redo. However, here is the tail end snippet of the stack trace after running
The 992 shards were first converted to
Should I have set |
@ayeeyecorp Just so I'm clear - you first ran reshard_fsdp on the shards (with unflatten-weights=true), then tried running convert_to_singleton on the consolidated shards? If that's so, then can you try running reshard_mp on the consolidated shards instead? |
you first ran reshard_fsdp on the shards (with unflatten-weights=true), then tried running convert_to_singleton on the consolidated shards? Correct, this resulted in the If that's so, then can you try running reshard_mp on the consolidated shards instead? Will do that again shortly and post stack trace results. |
@ayeeyecorp May I ask why you want to convert the 8 MP parts of OPT 175B into a singleton? I don't think you would be able to load the singleton into any GPU considering its size, which is about 350GB.
|
I started over earlier today from the 992 shards (resetting my environment per the instructions here using Python3.8) and verified that the 8 consolidated FSDP shards had the correct md5sum. Upon confirmation, I converted the checkpoints, to eliminate use of MP, to 1 with the
Not sure what the original problem was. The md5sum of the single checkpoint (325.2 GB) was: The subsequent step to convert to hugging face using:
failed after 1+ hour with the following stack trace:
I followed @patrickvonplaten conversion instructions found here and generated a config.json with the following:
Thoughts on what could be going wrong with the HF conversion? I will re-run the operation overnight and log the full failure stack trace. @tangbinh - thank you for the clarification. I am converting the 8 MP parts of OPT 175B into a singleton to run quantization experiments against |
@ayeeyecorp For OPT 175B, we should have |
@tangbinh that was quick! brilliant, will give that a go now. I blindly used values from HF... thank you |
After updating instance to 1TB+ of RAM... I successfully generated a .bin file using Thanks for the support. |
As noted in #689, convert_to_singleton doesn't produce statedicts with compatible keys (for some unknown reason).
Since reshard_mp can do the same job, without the GPU node requirement of convert_to_singleton, we should deprecate convert_to_singleton.
TODO: Work out dependencies on covert_to_singleton, and identify any special cases it can handle that reshard_mp can't (such as separating out qkv weights, as noted by @tangbinh)
The text was updated successfully, but these errors were encountered: