Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLLB-200 1.3B token dimension compatibility #5521

Open
leqij opened this issue Jul 11, 2024 · 0 comments
Open

NLLB-200 1.3B token dimension compatibility #5521

leqij opened this issue Jul 11, 2024 · 0 comments

Comments

@leqij
Copy link

leqij commented Jul 11, 2024

Hi, I have downloaded the 1.3B-dist checkpoint from this GitHub's NLLB section, and it reports to accept 256206 tokens as a dimension of the model layer. However, the dictionary.txt file in the GitHub contains 255997 token entries. Is this a compatibility issue, or are there additional steps I am failing to notice? I acknowledge that fitting the dimension resolves the error for training, but it is likely that the mismatch might cause off-by-something issues in the dictionary (since the order of token embeddings matter). I've also noticed that fairseq adds in 6 tokens for NLLB? I would appreciate it someone could answer my questions. Have a great day!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant