Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins as described by Croissant Format Specification are not supported by mlcroissant python library. #683

Open
AdrianUrbanski opened this issue Jun 7, 2024 · 1 comment

Comments

@AdrianUrbanski
Copy link

Hello! In the end I think I managed to create a working json file, but I decided to still open this issue to detail possible inconsistencies in the Croissant Format Specification.

Our dataset consists of images and two corresponding masks, all in different directories contained in a single zip file.
I decided that the best way to specify records is by creating three different record sets that can be joined using references. Images and masks share the same filename, which can be used to match them. I managed to successfully extract both the images and their filenames.

My problem is that simply including references="images/filename" or references="#{images/filename}" in the fields corresponding to masks' filenames causes AttributeError: 'str' object has no attribute 'uuid'. Is it possible to convert filename strings into objects that have the "uuid" attribute using mlcroissant python API? I did not find any reference for the API aside from the notebooks in recipes, which unfortunately do not feature the references functionality.

I then tried manually adding the keys and references to the json generated by python library according to the specification found here by:

I also tried including "references": {"@id": "images/filename"} as suggested in specs here. This in turn causes AttributeError: '_MISSING_TYPE' object has no attribute 'uuid'.

What worked in the end was adding "references": {"field": {"@id": "images/filename"}}, which was suggested in #651, but is not specified anywhere in the Croissant Format Specification.

@gsaluja9
Copy link

I stumbled onto this issue via search engine. I can confirm that I ran into this exact situation.
None of the options in docs indicate this way of specifying the references property.

I am just getting started with using the croissant library and am trying to make a croissant json with a toy(ish) dataset to explore the possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants