Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

chiragjn
Copy link
Contributor

@chiragjn chiragjn commented Jul 9, 2024

Closes #1689

Summary of changes:

  1. Adds tokenizer_default as option for chat_template in chat_template prompt strategy that allows using the chat template from tokenizer's config.json
  2. Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template
  3. Allows passing in a custom chat template using a new option chat_template_jinja which is only considered when chat_template option is set to jinja

Why?

Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens


@chiragjn chiragjn marked this pull request as ready for review July 9, 2024 20:49
@chiragjn
Copy link
Contributor Author

chiragjn commented Jul 9, 2024

On a separate note, I would also like to explore how can we support passing a jinja template entirely via the config itself which might me more flexible

_DEFAULT_FALLBACK_CHATML_TEMPLATE_CHOICE_PREFIX = "tokenizer_default_fallback_"


def chat_templates(user_choice: str, tokenizer=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially similarly included the tokenizer as an argument to chat template, but was trying to minimize the interdependency on this module with tokenizers. Seems like this is a reasonable workaround for now. thanks!

@winglian
Copy link
Collaborator

Thanks! Would you have some time to put together some tests for this?

Summary of changes:

1. Adds `tokenizer_default` as option for `chat_template` in
   `chat_template` prompt strategy that allows using the chat template
   from tokenizer's config.json
2. Allows falling back to chat templates available in axolotl if
   tokenizer does not have a chat template
3. Adds a mistral chat template which supports system message - taken
   from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja

---

Why?

Many popular models are not trained with chatml format. As a result for
the model to correctly learn chatml we have to turn on train_on_inputs
which requires more compute and time. If we can use the model's already
learned chat template we can just learn the output tokens

---

Todo:

- Write tests
@chiragjn chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from f2f01a8 to 4e38cea Compare July 12, 2024 03:35
@chiragjn
Copy link
Contributor Author

Added a few tests

I wonder if it is a good idea to use the chat_template available from the trained model's tokenizer_config.json as the default template?

Yes, that is what this PR aims to do, if we all agree, we can change the default in config to tokenizer_default_fallback_chatml :)

@ganler
Copy link

ganler commented Jul 12, 2024

Yes. I am sorry I just read the PR in more detail and found that's exactly what I was hoping for (what a coincidence!) so I removed my previous dumb question. Thanks for the contribution!

@ganler
Copy link

ganler commented Jul 12, 2024

I have one question. Can this option be used together with the sharegpt conversation format?

@winglian
Copy link
Collaborator

I have one question. Can this option be used together with the sharegpt conversation format?

Unfortunately no, they are different templating libraries.

@Tostino
Copy link
Contributor

Tostino commented Jul 15, 2024

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for #1756.

@chiragjn
Copy link
Contributor Author

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for #1756.

Amazing, would love to see both getting merged

@chiragjn chiragjn requested a review from winglian July 30, 2024 20:13
@chiragjn
Copy link
Contributor Author

Added chat_template_jinja option to provide a Jinja template as per @fozziethebeat's suggestion

@chiragjn chiragjn changed the title Allow using tokenizer's default chat template with fallbacks Allow using tokenizer's default chat template or pass custom jinja chat template Jul 30, 2024
@chiragjn
Copy link
Contributor Author

chiragjn commented Aug 6, 2024

Bumping for a review here @winglian

@nyxkrage
Copy link

Regarding point 4, it seems that the jinja template from that repo is very different when compared to how fastchat/axolotl currently formats the prompt, and mistrals templates

FastChat/Axolotl currently places the system message with a single newline inside the first user turn
Mistral 0.1 places the system message with 2 newlines inside the first user turn, and with different spacing between [INST] tokens
the Mistral v3 tokenizer places the system message with 2 newlines inside the last user turn

The one in the linked repo puts the system prompt before the first [INST], I'm unsure if this is an oversight by the creator of that template/repo or deliberately chosen to be different from the official mistral template

@chiragjn
Copy link
Contributor Author

chiragjn commented Aug 19, 2024

I can get rid of the mistral template now, anyway this PR now allows passing custom jinja template, I had only added it for convenience

deliberately chosen to be different from the official mistral template

Most likely deliberately as older mistral models did not support system message - so there was no official way for older models

@chiragjn
Copy link
Contributor Author

Bumping this again @winglian

@NanoCode012 NanoCode012 self-requested a review August 23, 2024 14:47
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, there. Thank you for the PR. We will try to get this PR in. Starting off, I have some points below:

  • We can assume if chat_template_jinja , chat_template==jinja without having to set it separately, probably?
  • Could these new configs be added to docs/config? We may also consider adding into docs/dataset-formats/conversation as an example on how to use this chat_template method as it was not documented before.
  • Which chat_template is being loaded, the root one or within ds_config? I wasn't able to clearly tell from the code. Maybe there needs to be improvement on doc to clarify this.

src/axolotl/utils/chat_templates.py Show resolved Hide resolved
src/axolotl/utils/chat_templates.py Outdated Show resolved Hide resolved
src/axolotl/utils/config/models/input/v0_4_1/__init__.py Outdated Show resolved Hide resolved
src/axolotl/utils/config/models/input/v0_4_1/__init__.py Outdated Show resolved Hide resolved
src/axolotl/utils/chat_templates.py Outdated Show resolved Hide resolved
src/axolotl/utils/chat_templates.py Outdated Show resolved Hide resolved
src/axolotl/utils/chat_templates.py Show resolved Hide resolved
@NanoCode012
Copy link
Collaborator

NanoCode012 commented Aug 26, 2024

Please let me know if there's any part you need help with. I think as a summary, the only parts left to be addressed are:

  • Documentation into config.qmd to list all these new configs
  • Documentation into conversation.qmd to show a few examples of this and explain the new options
  • Address the comments above
  • Test for DPO / ORPO (if not done)

@chiragjn chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from f183b5e to e05fc1f Compare August 26, 2024 22:58
@chiragjn chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from e05fc1f to 8a84408 Compare August 26, 2024 23:00
@chiragjn
Copy link
Contributor Author

Have addressed comments and updated docs to the best of my knowledge. Will test DPO/ORPO.

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Aug 27, 2024

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

That's a good point. I missed this.

In this case, we may need to adjust the hash to include the chat_template (if exists?).

f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"

docs/config.qmd Show resolved Hide resolved
docs/dataset-formats/conversation.qmd Outdated Show resolved Hide resolved
@levguy
Copy link

levguy commented Aug 28, 2024

Hi @chiragjn ,
I'd like to point out a potential issue that might occur when training DPO with prompt strategy chat_template.default and using the base model's (or custom) chat template.
In many open source models (e.g. Phi-3-medium-128k-instruct), the chat template validates that the first message has the user role, and raises an exception otherwise.
Consider the code here:

        result["chosen"] = tokenizer.apply_chat_template(
            [chosen],
            add_generation_prompt=False,
            chat_template=chat_template_string,
            tokenize=False,
        )

We pass here [chosen], which is a list containing a single message with assistant role. Hence the validation mentioned above will cause raising an exception.

This can be solved by adding a dummy user message, e.g.:

        dummy_user_message = {"role": "user", "content": "dummy"}
        result["chosen"] = tokenizer.apply_chat_template(
            [dummy_user_message, chosen],
            add_generation_prompt=False,
            chat_template=chat_template_str,
            tokenize=False,
        )

For trimming the dummy message, no additional code is needed, as it will be trimmed by the lines that follow:

        chosen_strip_index = result["chosen"].find(chosen["content"])
        result["chosen"] = result["chosen"][chosen_strip_index:].rstrip()

These lines remove what comes before the actual message content (typically an assistant header / special token).

The same workaround is needed in the rejected case as well (here).

@NanoCode012
Copy link
Collaborator

Hello @levguy , thank you for the find. Would it be better to separate that into a different PR to limit the scope of this PR? I think we would want to get this PR in first.

@chiragjn , may I ask if there are any parts blocking to this current PR now? I believe you have addressed the points I lifted. Only the hash of the tokenized dataset remains.

chat_template: Union[
ChatTemplate,
Annotated[str, StringConstraints(pattern="^tokenizer_default_fallback_")],
] = ChatTemplate.chatml
Copy link
Collaborator

@NanoCode012 NanoCode012 Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this chatml default also need to be changed to tokenizer_default?

I ran with the below model that has no chat_template in its config, and it defaulted to processing into chatml format.

base_model: mlx-community/gemma-2-2b

datasets:
  - path: LDJnr/Puffin
    type: chat_template

@NanoCode012
Copy link
Collaborator

I notice also that the loss masking may have some issues for chat template (likely not due to this PR, but an old issue).

For ex, it sometimes does not detect the assistant portion correctly.

# user masked correctly
(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) user(-100, 1645) 
(-100, 108) Each(-100, 7639)  problem(-100, 3210)  consists(-100, 13569)  of(-100, 576)  three(-100, 2149)
...
# assistant also masked
 <(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) assistant(-100, 105776) 
(-100, 108) If(-100, 2495)  the(-100, 573)  third(-100, 4906)  statement(-100, 6218)  is(-100, 603)  true(-100, 1382)  ((-100, 591) Bananas(-100, 217061)  cost(-100, 2458) 

Config to recreate:

base_model: mlx-community/gemma-2-2b

datasets:
  - path: LDJnr/Puffin
    type: chat_template
    chat_template: chatml
    shards: 10
chat_template: chatml

@chiragjn
Copy link
Contributor Author

chiragjn commented Sep 5, 2024

Sorry, I did not mean to block this PR
It is just that I was running busy at my main job, I wanted to address the DPO issues before merging, I'll make some time for this

@fozziethebeat
Copy link
Contributor

I'm excited to see this land!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using native chat_template from tokenizer config in chat_template strategy
8 participants