Runtime error when running examples (ilql_sentiments_t5.py) #587

youxiho1 · 2024-01-08T06:44:09Z

🐛 Describe the bug

Hi I'm running the examples provided in the official github repo.
I just simply run the command "python ilql_sentiments_t5.py"

However, I encountered into a runtime error

[RANK 0] Saving intermediate optimizer & model checkpoint into ckpts/checkpoint_1000
Traceback (most recent call last):
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 140, in
main()
File "/home/user/workspace/trlx/examples/ilql_sentiments_t5.py", line 130, in main
trlx.train(
File "/home/user/workspace/trlx/trlx/trlx.py", line 142, in train
trainer.learn()
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 598, in learn
self.save(directory)
File "/home/user/workspace/trlx/trlx/trainer/accelerate_base_trainer.py", line 312, in save
self.accelerator.save_state(dst_dir, **kwargs)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/accelerator.py", line 2708, in save_state
save_location = save_accelerator_state(
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/checkpointing.py", line 99, in save_accelerator_state
save(state, output_model_file, save_on_each_node=save_on_each_node, safe_serialization=safe_serialization)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/accelerate/utils/other.py", line 181, in save
save_func(obj, f)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/user/anaconda3/envs/trlx/lib/python3.9/site-packages/safetensors/torch.py", line 467, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.lm_head.weight', 'base_model.shared.weight', 'base_model.decoder.embed_tokens.weight', 'base_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

It seems like something about saving the model went into an error.

No idea about how to fix this. (Maybe I should revise the corresponding part of the source code of trlx???)
Thanks for your help!

Which trlX version are you using?

trlx==0.7.0

Additional system and package information

python 3.9.18, transformers 4.36.2, ubuntu 18.04

The text was updated successfully, but these errors were encountered:

xunguangwang · 2024-01-14T14:29:49Z

@youxiho1 Did you solve this problem?

DesikRengarajan · 2024-02-21T20:15:34Z

I have the same issue as well, when I am running ppo_sentiments.py
I have an imperfect solution where I just don't save the optimizer and model during training.
config.train.save_best = False
config.train.save_optimizer = False

youxiho1 added the bug Something isn't working label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error when running examples (ilql_sentiments_t5.py) #587

Runtime error when running examples (ilql_sentiments_t5.py) #587

youxiho1 commented Jan 8, 2024

xunguangwang commented Jan 14, 2024

DesikRengarajan commented Feb 21, 2024 •

edited

Loading

Runtime error when running examples (ilql_sentiments_t5.py) #587

Runtime error when running examples (ilql_sentiments_t5.py) #587

Comments

youxiho1 commented Jan 8, 2024

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

xunguangwang commented Jan 14, 2024

DesikRengarajan commented Feb 21, 2024 • edited Loading

DesikRengarajan commented Feb 21, 2024 •

edited

Loading