NLLB-200 1.3B token dimension compatibility #5521

leqij · 2024-07-11T22:19:54Z

Hi, I have downloaded the 1.3B-dist checkpoint from this GitHub's NLLB section, and it reports to accept 256206 tokens as a dimension of the model layer. However, the dictionary.txt file in the GitHub contains 255997 token entries. Is this a compatibility issue, or are there additional steps I am failing to notice? I acknowledge that fitting the dimension resolves the error for training, but it is likely that the mismatch might cause off-by-something issues in the dictionary (since the order of token embeddings matter). I've also noticed that fairseq adds in 6 tokens for NLLB? I would appreciate it someone could answer my questions. Have a great day!

leqij added needs triage question labels Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLLB-200 1.3B token dimension compatibility #5521

NLLB-200 1.3B token dimension compatibility #5521

leqij commented Jul 11, 2024

NLLB-200 1.3B token dimension compatibility #5521

NLLB-200 1.3B token dimension compatibility #5521

Comments

leqij commented Jul 11, 2024