Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeurIPS] Data type extension (Video and multichannel time series ) #690

Open
pollytur opened this issue Jun 11, 2024 · 3 comments
Open
Assignees

Comments

@pollytur
Copy link

pollytur commented Jun 11, 2024

It seems like videos are not supported in the Data types (here its only Image object, while in rdflib.namespace.SDO they have VideoObject separately ).

Also, I have a custom data type - technically, it is multidimensional time series (n channels $\times$ timepoints), so rdflib.namespace.SDO.ListItem would probably be the best for it but for now is it fine to use AudioObject for it?

Thanks a lot in advance!


Related to #371

@pierrot0
Copy link
Contributor

Thanks for reaching out!

Is the bug about the Croissant spec or about the mlcroissant python library?

One should be able to describe a dataset containing videos using Croissant, similarly as what is done in https://github.com/mlcommons/croissant/blob/main/datasets/1.0/audio_test/metadata.json (replacing sc:AudioObject by sc:VideoObject and audio/mpeg by video/mpeg for example).

It is however possible that libraries (including mlcroissant library) might not support videos atm.

Similarly as #696, when the data is stored in a file format which is not supported, we advise to create a Croissant dataset that specifies the dataset level information and the resources, while omitting RecordSets that contain data stored in files with an unsupported format.

This would unblock you, and it would allow tools that can work with only such metadata to already support your dataset (eg: index dataset, download raw data), while providing a signal for the Croissant contributors on which formats to support first, in the spec and/or various implementations.

Please let us know if there are problems with defining such an incomplete croissant definition and we will look into this.

@pierrot0
Copy link
Contributor

ok I see that checker raises an error in case of unknown mime type, we should extend that list and add a flag to allow for unknown mime types, we'll try to add that shortly.

@pierrot0
Copy link
Contributor

OK, so I did run validation (eg: mlcroissant validate --jsonld ../../datasets/1.0/titanic/metadata.json) on a croissant config containing an unknown encoding format, and it did not raise an error
(

else:
raise ValueError(
f"Unsupported encoding format for file: {encoding_format}"
)
was not raised, nor any other error).

And looking at the code, it seems to me like validation should work fine.
Do you have a command line that would reproduce failure to validate a croissant file due to unknown encodingFormat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants