Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-modal method(s) #1573

Open
trawler0 opened this issue Jul 6, 2024 · 1 comment
Open

Add multi-modal method(s) #1573

trawler0 opened this issue Jul 6, 2024 · 1 comment

Comments

@trawler0
Copy link

trawler0 commented Jul 6, 2024

Hello guys,
Thanks for this amazing repo, it is very useful for me.
I wanted to ask if there is interest in implementing methods like CLIP for image-language pretraining.
I understand that this might not be your actual focus and that web-scale-pretraining might be out of reach, however the paper https://arxiv.org/abs/2305.08675 shows that one can actually get relatively high zero-shot accuracies with effort roughly equal to imagenet pretraining.

@guarin
Copy link
Contributor

guarin commented Jul 8, 2024

Hi!
Multi-modal is definitely something we would like to incorporate. There are two main components missing for this: Data loading for text, and NLP models/tokenizers. For both cases we have to decide how to support them. This was quite easy for vision because data loading is pretty standardized and models are in torchvision. For text the landscape is more diverse and we'll have to compare the libraries first. Please let us know if you have any suggestions/inputs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants