Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

Open
guillermo-gabrielli-fer opened this issue Aug 9, 2024 · 2 comments

Comments

@guillermo-gabrielli-fer
Copy link

guillermo-gabrielli-fer commented Aug 9, 2024

Environment

Conda environment:
python=3.10
mergekit commit f086664 (latest as of yesterday)
transformers from git @ git+https://github.com/huggingface/transformers 85817d98fb60977c97e3014196a462b732d2ed1a (latest as of yesterday)

Same issue with the transformers version installed by mergekit, I think it's 4.44

Issue

When merging two models based on mistral v03 base, at saving the base tokenizer to avoid mutating it (" # HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok") , it fails to load it back.

configuration file (these were not the models I was trying originally but they reproduce the issue):

models:
  - model: mistralai/Mistral-7B-v0.3
  - model: mistralai/Mistral-7B-Instruct-v0.3
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.3
tokenizer:
  source: union
parameters:
  t:
    - value: 0.8
dtype: bfloat16

Originally I was trying to merge the base model with one with a custom tokenizer with the same vocabulary size but different tokens, I can link the model if needed, but I'm having the same issue with any Mistral v0.3 based model, so the custom tokenizer doesn't appear to be the issue.

Exception:
mergekit-yaml report_issue_mistral.yaml EXAMPLE_MISTRAL_ISSUE/ --out-shard-size 1B --cuda --lazy-unpickle -v

mergekit/mergekit/tokenizer/build.py", line 155, in build_union_tokenizer
    res = transformers.AutoTokenizer.from_pretrained(
[......]
transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

I could get past that error by saving also as legacy_format=True, but then it shows:

mergekit/mergekit/tokenizer/embed.py", line 62, in execute
    token_configs = dict(**self.tokens) or {}
TypeError: dict() argument after ** must be a mapping, not NoneType

I could get the merge to finish by moving the {} fallback inside the dict, but I'm not sure yet if the result is correct.

tracebacks.txt

pip_freeze.txt

@guillermo-gabrielli-fer guillermo-gabrielli-fer changed the title Issue with mistral v03 based models using union tokenizer source NoneType Issue with mistral v03 based models using union tokenizer source Aug 9, 2024
@guillermo-gabrielli-fer guillermo-gabrielli-fer changed the title NoneType Issue with mistral v03 based models using union tokenizer source NoneType Issue with mistral v03 based models when using union tokenizer source Aug 9, 2024
@guillermo-gabrielli-fer guillermo-gabrielli-fer changed the title NoneType Issue with mistral v03 based models when using union tokenizer source Null vocab_file Issue with mistral v03 based models when using union tokenizer source Aug 9, 2024
@ZakariaSakab
Copy link

ZakariaSakab commented Aug 28, 2024

I faced the same issue, probably this might fix
HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok
with tempfile.TemporaryDirectory() as p:
base_tok.save_pretrained(p, legacy_format=True, safe_serialization=True)
res = transformers.AutoTokenizer.from_pretrained(
p, use_fast=True, trust_remote_code=trust_remote_code
)

@ZakariaSakab
Copy link

I found also an issue when trying to quantize the resulting model using lama cpp.
The temp folder in this case is deleted immediately outside the code black which results in the loss of the tokenizer vocab file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants