how to use this data toolkit #1

xiaocaijizzz · 2021-12-09T03:14:04Z

[background]
I want to use my own text-image datasets to generate binary format dataset for CogView training in 'https://github.com/THUDM/CogView'. It has been mentioned in that repo the author use this cogdate toolkit to preprocess data.

[question]
Would you please tell me how to organize my raw text-image dataset, and then how to use the cogdata toolkit to generate the target bin file? for example, whether i should name the a text-image pair the same, such as 'a dog sits on the ground.txt' and 'a dog sits on the ground.png', or i should take other forms?

Sleepychord · 2021-12-10T17:00:09Z

I am too busy to write a tutorial, can you help him? @yzy-thu

yzy-thu · 2021-12-11T10:07:14Z

@xiaocaijizzz
More detailed documentation is here : 'https://sleepychord.github.io/cogdata/build/html/index.html'.
For example, you can use '--data_format TarDataset --data_files path_to_your_tar', or '--data_format ZipDataset --data_files path_to_your_zip' while creating dataset.
Images in zip are like '1.jpg, 2.jpg .....'
Then I recommend you use '--text_format dict --text_files path_to_your_txt'.
Text files are like : "{'1':'a dog sits on the ground', '2':'cat', ....}"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use this data toolkit #1

how to use this data toolkit #1

xiaocaijizzz commented Dec 9, 2021

Sleepychord commented Dec 10, 2021

yzy-thu commented Dec 11, 2021

how to use this data toolkit #1

how to use this data toolkit #1

Comments

xiaocaijizzz commented Dec 9, 2021

Sleepychord commented Dec 10, 2021

yzy-thu commented Dec 11, 2021