Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix eval for moe layer #124

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NeosZhang
Copy link

when I run /exp/dmoe/dmoe_46m_8gpu.sh,I encountered the following Error during evaluation.
image

the func save_load_balancing_loss is bypassed during evaluation.

@mvpatel2000
Copy link
Contributor

@NeosZhang is there a reason you are calling batched loss during eval? We explicitly do not store routing stats as it would affect the next update after an eval. Instead, the recommended solution is to avoid calling the loss fn here.

We can probably give a friendlier error here though... CC: @eitanturok

@eitanturok
Copy link
Collaborator

@mvpatel2000 , I took a look at the exp/dmoe/dmoe_46m_8gpu.sh script and it seems like we may need to modify the arguments used there to not get the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants