Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use this data toolkit #1

Open
xiaocaijizzz opened this issue Dec 9, 2021 · 2 comments
Open

how to use this data toolkit #1

xiaocaijizzz opened this issue Dec 9, 2021 · 2 comments

Comments

@xiaocaijizzz
Copy link

[background]
I want to use my own text-image datasets to generate binary format dataset for CogView training in 'https://github.com/THUDM/CogView'. It has been mentioned in that repo the author use this cogdate toolkit to preprocess data.

[question]
Would you please tell me how to organize my raw text-image dataset, and then how to use the cogdata toolkit to generate the target bin file? for example, whether i should name the a text-image pair the same, such as 'a dog sits on the ground.txt' and 'a dog sits on the ground.png', or i should take other forms?

@Sleepychord
Copy link
Owner

I am too busy to write a tutorial, can you help him? @yzy-thu

@yzy-thu
Copy link
Collaborator

yzy-thu commented Dec 11, 2021

@xiaocaijizzz
More detailed documentation is here : 'https://sleepychord.github.io/cogdata/build/html/index.html'.
For example, you can use '--data_format TarDataset --data_files path_to_your_tar', or '--data_format ZipDataset --data_files path_to_your_zip' while creating dataset.
Images in zip are like '1.jpg, 2.jpg .....'
Then I recommend you use '--text_format dict --text_files path_to_your_txt'.
Text files are like : "{'1':'a dog sits on the ground', '2':'cat', ....}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants