Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

guillermo-gabrielli-fer · 2024-08-09T15:54:44Z

Environment

Conda environment:
python=3.10
mergekit commit f086664 (latest as of yesterday)
transformers from git @ git+https://github.com/huggingface/transformers 85817d98fb60977c97e3014196a462b732d2ed1a (latest as of yesterday)

Same issue with the transformers version installed by mergekit, I think it's 4.44

Issue

When merging two models based on mistral v03 base, at saving the base tokenizer to avoid mutating it (" # HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok") , it fails to load it back.

configuration file (these were not the models I was trying originally but they reproduce the issue):

models:
  - model: mistralai/Mistral-7B-v0.3
  - model: mistralai/Mistral-7B-Instruct-v0.3
merge_method: slerp
base_model: mistralai/Mistral-7B-v0.3
tokenizer:
  source: union
parameters:
  t:
    - value: 0.8
dtype: bfloat16

Originally I was trying to merge the base model with one with a custom tokenizer with the same vocabulary size but different tokens, I can link the model if needed, but I'm having the same issue with any Mistral v0.3 based model, so the custom tokenizer doesn't appear to be the issue.

Exception:
mergekit-yaml report_issue_mistral.yaml EXAMPLE_MISTRAL_ISSUE/ --out-shard-size 1B --cuda --lazy-unpickle -v

mergekit/mergekit/tokenizer/build.py", line 155, in build_union_tokenizer
    res = transformers.AutoTokenizer.from_pretrained(
[......]
transformers/models/llama/tokenization_llama.py", line 201, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

I could get past that error by saving also as legacy_format=True, but then it shows:

mergekit/mergekit/tokenizer/embed.py", line 62, in execute
    token_configs = dict(**self.tokens) or {}
TypeError: dict() argument after ** must be a mapping, not NoneType

I could get the merge to finish by moving the {} fallback inside the dict, but I'm not sure yet if the result is correct.

tracebacks.txt

pip_freeze.txt

The text was updated successfully, but these errors were encountered:

ZakariaSakab · 2024-08-28T12:12:58Z

I faced the same issue, probably this might fix
HACK: save base tokenizer to temp dir and reload to avoid mutating base_tok
with tempfile.TemporaryDirectory() as p:
base_tok.save_pretrained(p, legacy_format=True, safe_serialization=True)
res = transformers.AutoTokenizer.from_pretrained(
p, use_fast=True, trust_remote_code=trust_remote_code
)

ZakariaSakab · 2024-08-28T14:52:51Z

I found also an issue when trying to quantize the resulting model using lama cpp.
The temp folder in this case is deleted immediately outside the code black which results in the loss of the tokenizer vocab file.

guillermo-gabrielli-fer changed the title ~~Issue with mistral v03 based models using union tokenizer source~~ NoneType Issue with mistral v03 based models using union tokenizer source Aug 9, 2024

guillermo-gabrielli-fer changed the title ~~NoneType Issue with mistral v03 based models using union tokenizer source~~ NoneType Issue with mistral v03 based models when using union tokenizer source Aug 9, 2024

guillermo-gabrielli-fer changed the title ~~NoneType Issue with mistral v03 based models when using union tokenizer source~~ Null vocab_file Issue with mistral v03 based models when using union tokenizer source Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

guillermo-gabrielli-fer commented Aug 9, 2024 •

edited

Loading

ZakariaSakab commented Aug 28, 2024 •

edited

Loading

ZakariaSakab commented Aug 28, 2024

Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

Null vocab_file Issue with mistral v03 based models when using union tokenizer source #394

Comments

guillermo-gabrielli-fer commented Aug 9, 2024 • edited Loading

Environment

Issue

ZakariaSakab commented Aug 28, 2024 • edited Loading

ZakariaSakab commented Aug 28, 2024

guillermo-gabrielli-fer commented Aug 9, 2024 •

edited

Loading

ZakariaSakab commented Aug 28, 2024 •

edited

Loading