Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up computation while using SPMD on large TPU pod #7987

Open
dudulightricks opened this issue Sep 10, 2024 · 3 comments
Open

Speeding up computation while using SPMD on large TPU pod #7987

dudulightricks opened this issue Sep 10, 2024 · 3 comments

Comments

@dudulightricks
Copy link

❓ Questions and Help

When running on vp-128 TPU pod (even when sharding only by batch dimension) we are experiencing very low performance comparing to the same pod without SPMD.

Do you have any tips how to increase the performance? some SPMD arguments? things we need to think about when using it? anything that might help because right now the performance is lower than regular in a factor.
@JackCaoG

@JackCaoG
Copy link
Collaborator

do you have a profile(xplane file) you can share? it is hard to guess what's happening without looking at the profile.

@giuliano-97
Copy link

@JackCaoG I've been trying to fine-tune Gemma-2 9B on v4 / v5 pods with FSDP + SPMD using HF transfomers and torch XLA and I also have the feeling that training is slow, do you have some benchmarks on training LLMs with the same setup?

@JackCaoG
Copy link
Collaborator

replied in the other thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants