[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

4ndr3aR · 2024-06-11T16:51:22Z

Hi! I think I've a similar problem as in #679. Our dataset contains .xz files as data/samples (just point clouds text files saved through np.savetxt(fd, points, fmt='%.5e') then xz-compressed for efficiency reasons) and .txt files as GT (using spaces as separators). The first line of GT contains a sort of header, just a int that tells the number of lines that must be read in the file. Then there is an arbitrary number of lines containing 7 or 8 columns, again, space-separated. I honestly don't see an "easy" way to represent all this in Croissant in a meaningful way, or at least I can't understand how to proceed. Let's say that, since it's a tool designed to "ingest" ML datasets, I would have at least expected a language closer to the discipline (dataset, subset, sample, ground truth, etc.). I've uploaded two single file_objects, one as GT and one as data/sample. Then the interface asks me the names of the fields, then it allows me to specify a regular expression (that is actually a good idea to grab e.g. the header/number of lines) but the interface gives no feedback about what's happening really or about what would happen with a given input. I think I'll give up for the moment, the idea is good but the tool doesn't seem usable yet, at least it isn't for non-standard cases like our dataset.

The text was updated successfully, but these errors were encountered:

pierrot0 self-assigned this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

4ndr3aR commented Jun 11, 2024 •

edited

Loading

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

[NeurIPS] How to "ingest" multiple datasets made of .xz files as data/samples and space-separated .txt files as ground truth #693

Comments

4ndr3aR commented Jun 11, 2024 • edited Loading

4ndr3aR commented Jun 11, 2024 •

edited

Loading