Skip to content

Can a custom dataset/tokenizer class be used without forking the project and manually splicing it in? #452

Answered by mitchellnw
mkaic asked this question in Q&A
Discussion options

You must be logged in to vote

Hello, regarding tokenizers, if you add a new model config under src/open_clip/model_configs then point to a tokenizer on huggingface that should work (see https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py#L76)

Regarding datasets, if you can convert your dataset to webdataset format or csv then it is supported via the train data flag, otherwise you'll have to manually add it for now

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@mkaic
Comment options

Answer selected by mkaic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants