Allow using tokenizer's default chat template or pass custom jinja chat template #1732

chiragjn · 2024-07-09T20:49:41Z

Summary of changes:

Adds tokenizer_default as option for chat_template in chat_template prompt strategy that allows using the chat template from tokenizer's config.json
Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template
Allows passing in a custom chat template using a new option chat_template_jinja which is only considered when chat_template option is set to jinja

Why?

Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens

chiragjn · 2024-07-09T20:51:29Z

On a separate note, I would also like to explore how can we support passing a jinja template entirely via the config itself which might me more flexible

winglian · 2024-07-10T13:47:37Z

src/axolotl/utils/chat_templates.py

+_DEFAULT_FALLBACK_CHATML_TEMPLATE_CHOICE_PREFIX = "tokenizer_default_fallback_"
+
+
+def chat_templates(user_choice: str, tokenizer=None):


I had initially similarly included the tokenizer as an argument to chat template, but was trying to minimize the interdependency on this module with tokenizers. Seems like this is a reasonable workaround for now. thanks!

winglian · 2024-07-10T13:51:46Z

Thanks! Would you have some time to put together some tests for this?

Summary of changes: 1. Adds `tokenizer_default` as option for `chat_template` in `chat_template` prompt strategy that allows using the chat template from tokenizer's config.json 2. Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template 3. Adds a mistral chat template which supports system message - taken from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja --- Why? Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens --- Todo: - Write tests

chiragjn · 2024-07-12T03:36:51Z

Added a few tests

I wonder if it is a good idea to use the chat_template available from the trained model's tokenizer_config.json as the default template?

Yes, that is what this PR aims to do, if we all agree, we can change the default in config to tokenizer_default_fallback_chatml :)

ganler · 2024-07-12T03:38:37Z

Yes. I am sorry I just read the PR in more detail and found that's exactly what I was hoping for (what a coincidence!) so I removed my previous dumb question. Thanks for the contribution!

ganler · 2024-07-12T03:43:32Z

I have one question. Can this option be used together with the sharegpt conversation format?

winglian · 2024-07-13T14:35:50Z

I have one question. Can this option be used together with the sharegpt conversation format?

Unfortunately no, they are different templating libraries.

Tostino · 2024-07-15T23:32:53Z

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for #1756.

chiragjn · 2024-07-18T10:41:07Z

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for #1756.

Amazing, would love to see both getting merged

chiragjn · 2024-07-30T20:14:48Z

Added chat_template_jinja option to provide a Jinja template as per @fozziethebeat's suggestion

chiragjn · 2024-08-06T15:53:56Z

Bumping for a review here @winglian

nyxkrage · 2024-08-18T13:48:23Z

Regarding point 4, it seems that the jinja template from that repo is very different when compared to how fastchat/axolotl currently formats the prompt, and mistrals templates

FastChat/Axolotl currently places the system message with a single newline inside the first user turn
Mistral 0.1 places the system message with 2 newlines inside the first user turn, and with different spacing between [INST] tokens
the Mistral v3 tokenizer places the system message with 2 newlines inside the last user turn

The one in the linked repo puts the system prompt before the first [INST], I'm unsure if this is an oversight by the creator of that template/repo or deliberately chosen to be different from the official mistral template

chiragjn · 2024-08-19T10:22:00Z

I can get rid of the mistral template now, anyway this PR now allows passing custom jinja template, I had only added it for convenience

deliberately chosen to be different from the official mistral template

Most likely deliberately as older mistral models did not support system message - so there was no official way for older models

chiragjn · 2024-08-22T22:14:43Z

Bumping this again @winglian

NanoCode012

Hey, there. Thank you for the PR. We will try to get this PR in. Starting off, I have some points below:

We can assume if chat_template_jinja , chat_template==jinja without having to set it separately, probably?
Could these new configs be added to docs/config? We may also consider adding into docs/dataset-formats/conversation as an example on how to use this chat_template method as it was not documented before.
Which chat_template is being loaded, the root one or within ds_config? I wasn't able to clearly tell from the code. Maybe there needs to be improvement on doc to clarify this.

src/axolotl/utils/chat_templates.py

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

src/axolotl/utils/chat_templates.py

NanoCode012 · 2024-08-26T07:18:41Z

Please let me know if there's any part you need help with. I think as a summary, the only parts left to be addressed are:

Documentation into config.qmd to list all these new configs
Documentation into conversation.qmd to show a few examples of this and explain the new options
Address the comments above
Test for DPO / ORPO (if not done)

…olotl into cj_tokenizer_default_prompt_template

chiragjn · 2024-08-26T23:03:48Z

Have addressed comments and updated docs to the best of my knowledge. Will test DPO/ORPO.

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

NanoCode012 · 2024-08-27T12:52:53Z

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

That's a good point. I missed this.

In this case, we may need to adjust the hash to include the chat_template (if exists?).

axolotl/src/axolotl/utils/data/sft.py

Line 143 in 17af1d7

f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"

docs/config.qmd

docs/dataset-formats/conversation.qmd

Co-authored-by: NanoCode012 <[email protected]>

levguy · 2024-08-28T10:18:02Z

Hi @chiragjn ,
I'd like to point out a potential issue that might occur when training DPO with prompt strategy chat_template.default and using the base model's (or custom) chat template.
In many open source models (e.g. Phi-3-medium-128k-instruct), the chat template validates that the first message has the user role, and raises an exception otherwise.
Consider the code here:

        result["chosen"] = tokenizer.apply_chat_template(
            [chosen],
            add_generation_prompt=False,
            chat_template=chat_template_string,
            tokenize=False,
        )

We pass here [chosen], which is a list containing a single message with assistant role. Hence the validation mentioned above will cause raising an exception.

This can be solved by adding a dummy user message, e.g.:

        dummy_user_message = {"role": "user", "content": "dummy"}
        result["chosen"] = tokenizer.apply_chat_template(
            [dummy_user_message, chosen],
            add_generation_prompt=False,
            chat_template=chat_template_str,
            tokenize=False,
        )

For trimming the dummy message, no additional code is needed, as it will be trimmed by the lines that follow:

        chosen_strip_index = result["chosen"].find(chosen["content"])
        result["chosen"] = result["chosen"][chosen_strip_index:].rstrip()

These lines remove what comes before the actual message content (typically an assistant header / special token).

The same workaround is needed in the rejected case as well (here).

NanoCode012 · 2024-09-05T09:27:34Z

Hello @levguy , thank you for the find. Would it be better to separate that into a different PR to limit the scope of this PR? I think we would want to get this PR in first.

@chiragjn , may I ask if there are any parts blocking to this current PR now? I believe you have addressed the points I lifted. Only the hash of the tokenized dataset remains.

NanoCode012 · 2024-09-05T09:57:46Z

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

+    chat_template: Union[
+        ChatTemplate,
+        Annotated[str, StringConstraints(pattern="^tokenizer_default_fallback_")],
+    ] = ChatTemplate.chatml


Does this chatml default also need to be changed to tokenizer_default?

I ran with the below model that has no chat_template in its config, and it defaulted to processing into chatml format.

base_model: mlx-community/gemma-2-2b datasets: - path: LDJnr/Puffin type: chat_template

NanoCode012 · 2024-09-05T10:28:17Z

I notice also that the loss masking may have some issues for chat template (likely not due to this PR, but an old issue).

For ex, it sometimes does not detect the assistant portion correctly.

# user masked correctly
(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) user(-100, 1645) 
(-100, 108) Each(-100, 7639)  problem(-100, 3210)  consists(-100, 13569)  of(-100, 576)  three(-100, 2149)
...
# assistant also masked
 <(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) assistant(-100, 105776) 
(-100, 108) If(-100, 2495)  the(-100, 573)  third(-100, 4906)  statement(-100, 6218)  is(-100, 603)  true(-100, 1382)  ((-100, 591) Bananas(-100, 217061)  cost(-100, 2458)

Config to recreate:

base_model: mlx-community/gemma-2-2b

datasets:
  - path: LDJnr/Puffin
    type: chat_template
    chat_template: chatml
    shards: 10
chat_template: chatml

chiragjn · 2024-09-05T11:11:21Z

Sorry, I did not mean to block this PR
It is just that I was running busy at my main job, I wanted to address the DPO issues before merging, I'll make some time for this

fozziethebeat · 2024-09-12T18:57:42Z

I'm excited to see this land!

chiragjn marked this pull request as ready for review July 9, 2024 20:49

winglian reviewed Jul 10, 2024

View reviewed changes

chiragjn added 2 commits July 12, 2024 08:42

Add tests

4e38cea

chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from f2f01a8 to 4e38cea Compare July 12, 2024 03:35

chiragjn added 4 commits July 23, 2024 17:16

Merge branch 'main' into cj_tokenizer_default_prompt_template

99b3bc7

Merge branch 'main' into cj_tokenizer_default_prompt_template

fd7538d

Fix lint and bug post merge from main

34ea51d

Add option chat_template_jinja to provide a jinja template

eb188ac

chiragjn requested a review from winglian July 30, 2024 20:13

chiragjn changed the title ~~Allow using tokenizer's default chat template with fallbacks~~ Allow using tokenizer's default chat template or pass custom jinja chat template Jul 30, 2024

Merge branch 'main' into cj_tokenizer_default_prompt_template

89f382a

Merge branch 'main' into cj_tokenizer_default_prompt_template

21a2302

chiragjn added 3 commits August 19, 2024 15:52

Merge branch 'main' into cj_tokenizer_default_prompt_template

2e758ae

remove custom mistral template

6ef76f1

Merge branch 'main' into cj_tokenizer_default_prompt_template

8ee30f5

chiragjn mentioned this pull request Aug 22, 2024

Enable using the tokenizer default chat template of base model #1855

Closed

NanoCode012 self-requested a review August 23, 2024 14:47

NanoCode012 reviewed Aug 24, 2024

View reviewed changes

NanoCode012 reviewed Aug 26, 2024

View reviewed changes

src/axolotl/utils/chat_templates.py Show resolved Hide resolved

NanoCode012 mentioned this pull request Aug 26, 2024

Warn if we override the chat template in the tokenizer config #1822

Closed

Merge branch 'main' of https://github.com/OpenAccess-AI-Collective/ax…

4805f3c

…olotl into cj_tokenizer_default_prompt_template

chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from f183b5e to e05fc1f Compare August 26, 2024 22:58

Address review comments and add docs

8a84408

chiragjn force-pushed the cj_tokenizer_default_prompt_template branch from e05fc1f to 8a84408 Compare August 26, 2024 23:00

NanoCode012 reviewed Aug 27, 2024

View reviewed changes

docs/config.qmd Show resolved Hide resolved

docs/dataset-formats/conversation.qmd Outdated Show resolved Hide resolved

chiragjn and others added 2 commits August 27, 2024 19:08

Update docs/dataset-formats/conversation.qmd

efeaa00

Co-authored-by: NanoCode012 <[email protected]>

Merge branch 'main' into cj_tokenizer_default_prompt_template

b1bb2ac

NanoCode012 reviewed Sep 5, 2024

View reviewed changes

Merge branch 'main' into cj_tokenizer_default_prompt_template

260ca97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

chiragjn commented Jul 9, 2024 •

edited

Loading

chiragjn commented Jul 9, 2024

winglian Jul 10, 2024

winglian commented Jul 10, 2024

chiragjn commented Jul 12, 2024

ganler commented Jul 12, 2024

ganler commented Jul 12, 2024

winglian commented Jul 13, 2024

Tostino commented Jul 15, 2024

chiragjn commented Jul 18, 2024

chiragjn commented Jul 30, 2024

chiragjn commented Aug 6, 2024

nyxkrage commented Aug 18, 2024

chiragjn commented Aug 19, 2024 •

edited

Loading

chiragjn commented Aug 22, 2024

NanoCode012 left a comment

NanoCode012 commented Aug 26, 2024 •

edited

Loading

chiragjn commented Aug 26, 2024

NanoCode012 commented Aug 27, 2024 •

edited

Loading

levguy commented Aug 28, 2024

NanoCode012 commented Sep 5, 2024

NanoCode012 Sep 5, 2024 •

edited

Loading

NanoCode012 commented Sep 5, 2024

chiragjn commented Sep 5, 2024

fozziethebeat commented Sep 12, 2024

		_DEFAULT_FALLBACK_CHATML_TEMPLATE_CHOICE_PREFIX = "tokenizer_default_fallback_"


		def chat_templates(user_choice: str, tokenizer=None):

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Are you sure you want to change the base?

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Conversation

chiragjn commented Jul 9, 2024 • edited Loading

chiragjn commented Jul 9, 2024

winglian Jul 10, 2024

Choose a reason for hiding this comment

winglian commented Jul 10, 2024

chiragjn commented Jul 12, 2024

ganler commented Jul 12, 2024

ganler commented Jul 12, 2024

winglian commented Jul 13, 2024

Tostino commented Jul 15, 2024

chiragjn commented Jul 18, 2024

chiragjn commented Jul 30, 2024

chiragjn commented Aug 6, 2024

nyxkrage commented Aug 18, 2024

chiragjn commented Aug 19, 2024 • edited Loading

chiragjn commented Aug 22, 2024

NanoCode012 left a comment

Choose a reason for hiding this comment

NanoCode012 commented Aug 26, 2024 • edited Loading

chiragjn commented Aug 26, 2024

NanoCode012 commented Aug 27, 2024 • edited Loading

levguy commented Aug 28, 2024

NanoCode012 commented Sep 5, 2024

NanoCode012 Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

NanoCode012 commented Sep 5, 2024

chiragjn commented Sep 5, 2024

fozziethebeat commented Sep 12, 2024

chiragjn commented Jul 9, 2024 •

edited

Loading

chiragjn commented Aug 19, 2024 •

edited

Loading

NanoCode012 commented Aug 26, 2024 •

edited

Loading

NanoCode012 commented Aug 27, 2024 •

edited

Loading

NanoCode012 Sep 5, 2024 •

edited

Loading