Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading wikiart captions from jsonl files #28

Open
ajie6666 opened this issue Sep 6, 2024 · 7 comments
Open

reading wikiart captions from jsonl files #28

ajie6666 opened this issue Sep 6, 2024 · 7 comments

Comments

@ajie6666
Copy link

ajie6666 commented Sep 6, 2024

Hello, I want to read the caption of the wikiart section in the latest JSONL file. I am using the following code, but I am unable to read it.
##################################################
import json
import jsonlines
import pprint

with open('./json_files/StyleGallery.jsonl') as file:
for line in jsonlines.Reader(file):
if "img_file" in line :
pprint.pprint(line["img_file"])

@Jeoyal
Copy link
Contributor

Jeoyal commented Sep 6, 2024

import json

with open("./StyleGallery.jsonl", 'r') as f:
datas = f.readlines()
for data in datas:
data = json.loads(data)
print(data['content_prompt'])

@ajie6666
Copy link
Author

ajie6666 commented Sep 6, 2024

Thanks for your reply. This outputs all the tags, so how can I tell if it belongs to the wikiart dataset? Because MultiGen-20M and JourneyDB also have "content_prompt".

@Jeoyal
Copy link
Contributor

Jeoyal commented Sep 6, 2024

Word "wikiart" should be in data["image_file"].

@ajie6666
Copy link
Author

ajie6666 commented Sep 6, 2024

I see what you mean, I tried the following code:
##################################################
import json
import pprint

with open("./json_files/StyleGallery.jsonl", 'r') as f:
datas = f.readlines()
for data in datas:
data = json.loads(data)
if "wikiart" in data['image_file'] :
pprint.pprint(data['content_prompt'])
###############################################
But I'm getting an error :KeyError: 'image_file'
And I changed“image_file”to“img_file”:
###############################################
import json
import pprint

with open("./json_files/StyleGallery.jsonl", 'r') as f:
datas = f.readlines()
for data in datas:
data = json.loads(data)
if "wikiart" in data['img_file'] :
pprint.pprint(data['content_prompt'])
#################################################
I also got an error :
Traceback (most recent call last):
File "read_jsonfile.py", line 30, in
data = json.loads(data)
File "/opt/conda/envs/styleshot/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/opt/conda/envs/styleshot/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/envs/styleshot/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 806 (char 805)
#############################################################
T-T

@Jeoyal
Copy link
Contributor

Jeoyal commented Sep 6, 2024

Hi, i found something wrong in our StyleGallery.jsonl, i will update the correct version soon.

@Jeoyal
Copy link
Contributor

Jeoyal commented Sep 6, 2024

It might take two hours.

@Jeoyal
Copy link
Contributor

Jeoyal commented Sep 6, 2024

Hi, i have updated it in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants