Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error when running examples (ilql_sentiments_t5.py) #587

Open
youxiho1 opened this issue Jan 8, 2024 · 2 comments
Open

Runtime error when running examples (ilql_sentiments_t5.py) #587

youxiho1 opened this issue Jan 8, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@youxiho1
Copy link

youxiho1 commented Jan 8, 2024

🐛 Describe the bug

Hi I'm running the examples provided in the official github repo.
I just simply run the command "python ilql_sentiments_t5.py"

However, I encountered into a runtime error

[RANK 0] Saving intermediate optimizer & model checkpoint into ckpts/checkpoint_1000
Traceback (most recent call last):
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 140, in
main()
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 130, in main
trlx.train(
File "/home/user/workspace/trlx/trlx/trlx.py", line 142, in train
trainer.learn()
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 598, in learn
self.save(directory)
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 312, in save
self.accelerator.save_state(dst_dir, **kwargs)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py", line 2708, in save_state
save_location = save_accelerator_state(
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/checkpointing.py", line 99, in save_accelerator_state
save(state, output_model_file, save_on_each_node=save_on_each_node, safe_serialization=safe_serialization)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/other.py", line 181, in save
save_func(obj, f)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 467, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.lm_head.weight', 'base_model.shared.weight', 'base_model.decoder.embed_tokens.weight', 'base_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

It seems like something about saving the model went into an error.

No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!

Which trlX version are you using?

trlx==0.7.0

Additional system and package information

python 3.9.18, transformers 4.36.2, ubuntu 18.04

@youxiho1 youxiho1 added the bug Something isn't working label Jan 8, 2024
@xunguangwang
Copy link

@youxiho1 Did you solve this problem?

@DesikRengarajan
Copy link

DesikRengarajan commented Feb 21, 2024

I have the same issue as well, when I am running ppo_sentiments.py
I have an imperfect solution where I just don't save the optimizer and model during training.
config.train.save_best = False
config.train.save_optimizer = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants