Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use columntable in datavaluerows for col sources #279

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

davidanthoff
Copy link
Collaborator

@quinnj, the idea here would be that https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L93 could be rewritten as

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df)

But I'm not entirely sure this is right :) I think this PR only makes sense if the idea is that for any table that is primarily column based, one should first create a columntable before one calls the datavaluerows function. Is that so? I don't really understand the underlying mechanics there.

The benefit of this PR would be that now the instructions for any Tables.jl source that wants to also enable TableTraits.jl integration would be a bit simpler, i.e. the instructions would be to just always add

IteratorInterfaceExtensions.getiterator(x::MyType) = Tables.datavaluerows(x)
IteratorInterfaceExtensions.isiterable(::MyType) = true
TableTraits.isiterabletable(::MyType) = true

and that would work regardless of whether the source internally stores things as columns or rows.

@davidanthoff
Copy link
Collaborator Author

Actually, just out of curiosity: why is the call to Tables.columntable needed in https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L94 in the currently shipping version?

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #279 (3ddc401) into main (138c5be) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #279   +/-   ##
=======================================
  Coverage   94.88%   94.88%           
=======================================
  Files           7        7           
  Lines         665      665           
=======================================
  Hits          631      631           
  Misses         34       34           
Impacted Files Coverage Δ
src/tofromdatavalues.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 138c5be...3ddc401. Read the comment docs.

@bkamins
Copy link
Member

bkamins commented Apr 6, 2022

why is the call to Tables.columntable needed

I have not implemented it, but it turns type-unstable to type-stable container.

@quinnj
Copy link
Member

quinnj commented Apr 9, 2022

Actually, just out of curiosity: why is the call to Tables.columntable needed in https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L94 in the currently shipping version?

This is necessary for performance in Query.jl, where QueryOperators expect type-stable rows. At one point, we had the case where we were calling NamedTuple(::DataFrameRow) and it was too expensive and hurt performance in Query framework. We then had the discussion that Query.jl is really only geared towards narrow tables that can return something fully type-stable from getiterator (i.e. concrete eltype), so hence the change to call Tables.columntable.

src/tofromdatavalues.jl Outdated Show resolved Hide resolved
@davidanthoff
Copy link
Collaborator Author

@quinnj one more question: in the case where Tables.columnaccess(x)==false, this PR now would essentially amount to

    r = Tables.rows(x)
    s = Tables.schema(r)
    s === nothing && error("Schemaless sources cannot be passed to datavaluerows.")
    return DataValueRowIterator{datavaluenamedtuple(s), typeof(s), typeof(r)}(r)

but rereading your comment above, that would probably be type-instable, right?

The scenario that I have in the back of my mind is row wise reading from CSV.jl in combination with Query.jl. It would be nice if something like

CSV.Rows(filename) |> @filter(...) |> @select(...) |> DataFrame

would just work out of the box. My thinking had been that if we merge this PR here and then add:

IteratorInterfaceExtensions.getiterator(x::CSV.Rows) = Tables.datavaluerows(x)
IteratorInterfaceExtensions.isiterable(::CSV.Rows) = true
TableTraits.isiterabletable(::CSV.Rows) = true

it would all work, but now I'm thinking that we might end up with a type instable situation, right?

Is there some equivalent of Tables.columntable that we could use in the row iterating case that doesn't materialize all the columns, but just provides an automatic type stable named tuple iterator around a row wise source like CSV.Rows?

@quinnj
Copy link
Member

quinnj commented Apr 14, 2022

Is there some equivalent of Tables.columntable that we could use in the row iterating case that doesn't materialize all the columns, but just provides an automatic type stable named tuple iterator around a row wise source like CSV.Rows?

I think for the rows case, we just want to wrap them as-is with IteratorInterfaceExtensions.getiterator(x::CSV.Rows) = Tables.datavaluerows(x) as you specified above; the point is that DataValueRowIterator is type-stable and turns any Tables.jl-compatible "row" object into a fully-typed NamedTuple w/ DataValue for missing. So I think this is PR is fine as-is. We could probably do some performance testing to confirm, but I believe that would be essentially the same perf testing I did originally when adding this functionality to Tables.jl to cover the Query.jl cases for convenience. Really the main value of this PR is allow column-based sources to more automatically be type-stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants