-
-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using tokenizer's default chat template or pass custom jinja chat template #1732
base: main
Are you sure you want to change the base?
Allow using tokenizer's default chat template or pass custom jinja chat template #1732
Conversation
On a separate note, I would also like to explore how can we support passing a jinja template entirely via the config itself which might me more flexible |
src/axolotl/utils/chat_templates.py
Outdated
_DEFAULT_FALLBACK_CHATML_TEMPLATE_CHOICE_PREFIX = "tokenizer_default_fallback_" | ||
|
||
|
||
def chat_templates(user_choice: str, tokenizer=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had initially similarly included the tokenizer as an argument to chat template, but was trying to minimize the interdependency on this module with tokenizers. Seems like this is a reasonable workaround for now. thanks!
Thanks! Would you have some time to put together some tests for this? |
Summary of changes: 1. Adds `tokenizer_default` as option for `chat_template` in `chat_template` prompt strategy that allows using the chat template from tokenizer's config.json 2. Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template 3. Adds a mistral chat template which supports system message - taken from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja --- Why? Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens --- Todo: - Write tests
f2f01a8
to
4e38cea
Compare
Added a few tests
Yes, that is what this PR aims to do, if we all agree, we can change the default in config to |
Yes. I am sorry I just read the PR in more detail and found that's exactly what I was hoping for (what a coincidence!) so I removed my previous dumb question. Thanks for the contribution! |
I have one question. Can this option be used together with the sharegpt conversation format? |
Unfortunately no, they are different templating libraries. |
Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for #1756. |
Amazing, would love to see both getting merged |
Added |
Bumping for a review here @winglian |
Regarding point 4, it seems that the jinja template from that repo is very different when compared to how fastchat/axolotl currently formats the prompt, and mistrals templates FastChat/Axolotl currently places the system message with a single newline inside the first user turn The one in the linked repo puts the system prompt before the first [INST], I'm unsure if this is an oversight by the creator of that template/repo or deliberately chosen to be different from the official mistral template |
I can get rid of the mistral template now, anyway this PR now allows passing custom jinja template, I had only added it for convenience
Most likely deliberately as older mistral models did not support system message - so there was no official way for older models |
Bumping this again @winglian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, there. Thank you for the PR. We will try to get this PR in. Starting off, I have some points below:
- We can assume if
chat_template_jinja
,chat_template==jinja
without having to set it separately, probably? - Could these new configs be added to
docs/config
? We may also consider adding intodocs/dataset-formats/conversation
as an example on how to use this chat_template method as it was not documented before. - Which
chat_template
is being loaded, the root one or withinds_config
? I wasn't able to clearly tell from the code. Maybe there needs to be improvement on doc to clarify this.
Please let me know if there's any part you need help with. I think as a summary, the only parts left to be addressed are:
|
…olotl into cj_tokenizer_default_prompt_template
f183b5e
to
e05fc1f
Compare
e05fc1f
to
8a84408
Compare
Have addressed comments and updated docs to the best of my knowledge. Will test DPO/ORPO. One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this. |
That's a good point. I missed this. In this case, we may need to adjust the hash to include the chat_template (if exists?). axolotl/src/axolotl/utils/data/sft.py Line 143 in 17af1d7
|
Co-authored-by: NanoCode012 <[email protected]>
Hi @chiragjn ,
We pass here This can be solved by adding a dummy user message, e.g.:
For trimming the dummy message, no additional code is needed, as it will be trimmed by the lines that follow:
These lines remove what comes before the actual message content (typically an The same workaround is needed in the |
Hello @levguy , thank you for the find. Would it be better to separate that into a different PR to limit the scope of this PR? I think we would want to get this PR in first. @chiragjn , may I ask if there are any parts blocking to this current PR now? I believe you have addressed the points I lifted. Only the hash of the tokenized dataset remains. |
chat_template: Union[ | ||
ChatTemplate, | ||
Annotated[str, StringConstraints(pattern="^tokenizer_default_fallback_")], | ||
] = ChatTemplate.chatml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this chatml
default also need to be changed to tokenizer_default
?
I ran with the below model that has no chat_template in its config, and it defaulted to processing into chatml
format.
base_model: mlx-community/gemma-2-2b
datasets:
- path: LDJnr/Puffin
type: chat_template
I notice also that the loss masking may have some issues for chat template (likely not due to this PR, but an old issue). For ex, it sometimes does not detect the assistant portion correctly.
Config to recreate: base_model: mlx-community/gemma-2-2b
datasets:
- path: LDJnr/Puffin
type: chat_template
chat_template: chatml
shards: 10
chat_template: chatml |
Sorry, I did not mean to block this PR |
I'm excited to see this land! |
Closes #1689
Summary of changes:
tokenizer_default
as option forchat_template
inchat_template
prompt strategy that allows using the chat template from tokenizer's config.jsonchat_template_jinja
which is only considered whenchat_template
option is set tojinja
Why?
Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens