Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: loading a record set as a pandas dataframe #706

Open
ogrisel opened this issue Jun 26, 2024 · 4 comments
Open

Feature request: loading a record set as a pandas dataframe #706

ogrisel opened this issue Jun 26, 2024 · 4 comments

Comments

@ogrisel
Copy link

ogrisel commented Jun 26, 2024

If the file object is a CSV, TSV or parquet file, mlcroissant is already using pandas in its internals. However I could not find any public API to fetch a record set as a pandas dataframe.

After a bit of tweaking, the closest thing I could achieve with the public API was:

import pandas as pd
import mlcroissant as mlc

dataset_url = "..."
record_set_name = "..."

dataset = mlc.Dataset(dataset_url)
df = pd.DataFrame.from_records(list(dataset.records(record_set_name)))

but it seems incredibly inefficient for many reasons:

  • we need to allocate a temporary list because the Records iterable has no __len__ attribute: this means that we allocate a lot of memory to temporarily store all those records as a list of dicts of Python objects before being able to load them efficiently into the pandas dataframe,
  • the records iterable generates many temporary Python scalar objects (str, int, float, ...) in the process and then will be garbage collected afterwards once consumed by pd.DataFrame.from_records: this causes a lot of unnecessary overhead via Python GC housekeeping of many small objects for no good reason.
  • the intermediate Python objects in the records do not preserve the original dtype information (int32 vs int64 vs uint8... or nominal or ordinal categorical dtypes), hence the resulting dataframe might loose important side information for the downstream tasks. Some of this information (e.g. categorical dtype info) might be present in the .metadata attribute of the dataset but that requires extra effort to retyped the dataframe columns using this and it's also yet another cause of inefficiency.

All of those problems would vanish if there was a way to access the underlying internal pandas dataframe whenever a given records is only backed by a single file object read by pandas.

@ogrisel ogrisel changed the title Feature request: loading a records as a pandas dataframe Feature request: loading a record set as a pandas dataframe Jun 26, 2024
@marcenacp
Copy link
Contributor

@ogrisel Thanks for creating the issue! It's a great feature.

The API doesn't exist yet, I agree it could easily work for small datasets (=backed by 1 file) without joins.

@ogrisel
Copy link
Author

ogrisel commented Jun 26, 2024

Even for datasets with multiple record sets, it would be nice to allow the user to retrieve each of them as a dataframe and let them use pandas to compute merge or aggregations as they want.

@ogrisel
Copy link
Author

ogrisel commented Jun 26, 2024

I agree those, that for records sets backed by multiple file objects this would be more challenging / not possible to achieve.

@shreyanmitra
Copy link

shreyanmitra commented Jul 29, 2024

@marcenacp Any updates on this yet? I am working on a personal project that would be easier to implement with this feature. Otherwise, I'll have to write it myself, which I don't want to do if a solution already exists. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants