Feature request: loading a record set as a pandas dataframe #706

ogrisel · 2024-06-26T09:14:18Z

If the file object is a CSV, TSV or parquet file, mlcroissant is already using pandas in its internals. However I could not find any public API to fetch a record set as a pandas dataframe.

After a bit of tweaking, the closest thing I could achieve with the public API was:

import pandas as pd
import mlcroissant as mlc

dataset_url = "..."
record_set_name = "..."

dataset = mlc.Dataset(dataset_url)
df = pd.DataFrame.from_records(list(dataset.records(record_set_name)))

but it seems incredibly inefficient for many reasons:

we need to allocate a temporary list because the Records iterable has no __len__ attribute: this means that we allocate a lot of memory to temporarily store all those records as a list of dicts of Python objects before being able to load them efficiently into the pandas dataframe,
the records iterable generates many temporary Python scalar objects (str, int, float, ...) in the process and then will be garbage collected afterwards once consumed by pd.DataFrame.from_records: this causes a lot of unnecessary overhead via Python GC housekeeping of many small objects for no good reason.
the intermediate Python objects in the records do not preserve the original dtype information (int32 vs int64 vs uint8... or nominal or ordinal categorical dtypes), hence the resulting dataframe might loose important side information for the downstream tasks. Some of this information (e.g. categorical dtype info) might be present in the .metadata attribute of the dataset but that requires extra effort to retyped the dataframe columns using this and it's also yet another cause of inefficiency.

All of those problems would vanish if there was a way to access the underlying internal pandas dataframe whenever a given records is only backed by a single file object read by pandas.

The text was updated successfully, but these errors were encountered:

marcenacp · 2024-06-26T11:28:53Z

@ogrisel Thanks for creating the issue! It's a great feature.

The API doesn't exist yet, I agree it could easily work for small datasets (=backed by 1 file) without joins.

ogrisel · 2024-06-26T12:11:52Z

Even for datasets with multiple record sets, it would be nice to allow the user to retrieve each of them as a dataframe and let them use pandas to compute merge or aggregations as they want.

ogrisel · 2024-06-26T12:14:42Z

I agree those, that for records sets backed by multiple file objects this would be more challenging / not possible to achieve.

shreyanmitra · 2024-07-29T21:27:29Z

@marcenacp Any updates on this yet? I am working on a personal project that would be easier to implement with this feature. Otherwise, I'll have to write it myself, which I don't want to do if a solution already exists. :)

ogrisel changed the title ~~Feature request: loading a records as a pandas dataframe~~ Feature request: loading a record set as a pandas dataframe Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: loading a record set as a pandas dataframe #706

Feature request: loading a record set as a pandas dataframe #706

ogrisel commented Jun 26, 2024

marcenacp commented Jun 26, 2024

ogrisel commented Jun 26, 2024

ogrisel commented Jun 26, 2024

shreyanmitra commented Jul 29, 2024 •

edited

Loading

Feature request: loading a record set as a pandas dataframe #706

Feature request: loading a record set as a pandas dataframe #706

Comments

ogrisel commented Jun 26, 2024

marcenacp commented Jun 26, 2024

ogrisel commented Jun 26, 2024

ogrisel commented Jun 26, 2024

shreyanmitra commented Jul 29, 2024 • edited Loading

shreyanmitra commented Jul 29, 2024 •

edited

Loading