-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: loading a record set as a pandas dataframe #706
Comments
@ogrisel Thanks for creating the issue! It's a great feature. The API doesn't exist yet, I agree it could easily work for small datasets (=backed by 1 file) without joins. |
Even for datasets with multiple record sets, it would be nice to allow the user to retrieve each of them as a dataframe and let them use pandas to compute merge or aggregations as they want. |
I agree those, that for records sets backed by multiple file objects this would be more challenging / not possible to achieve. |
@marcenacp Any updates on this yet? I am working on a personal project that would be easier to implement with this feature. Otherwise, I'll have to write it myself, which I don't want to do if a solution already exists. :) |
If the file object is a CSV, TSV or parquet file, mlcroissant is already using pandas in its internals. However I could not find any public API to fetch a record set as a pandas dataframe.
After a bit of tweaking, the closest thing I could achieve with the public API was:
but it seems incredibly inefficient for many reasons:
Records
iterable has no__len__
attribute: this means that we allocate a lot of memory to temporarily store all those records as a list of dicts of Python objects before being able to load them efficiently into the pandas dataframe,str
,int
,float
, ...) in the process and then will be garbage collected afterwards once consumed bypd.DataFrame.from_records
: this causes a lot of unnecessary overhead via Python GC housekeeping of many small objects for no good reason.int32
vsint64
vsuint8
... or nominal or ordinal categorical dtypes), hence the resulting dataframe might loose important side information for the downstream tasks. Some of this information (e.g. categorical dtype info) might be present in the.metadata
attribute of the dataset but that requires extra effort to retyped the dataframe columns using this and it's also yet another cause of inefficiency.All of those problems would vanish if there was a way to access the underlying internal pandas dataframe whenever a given records is only backed by a single file object read by pandas.
The text was updated successfully, but these errors were encountered: