Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: In-memory sparse array interchange #840

Open
hameerabbasi opened this issue Sep 6, 2024 · 15 comments
Open

RFC: In-memory sparse array interchange #840

hameerabbasi opened this issue Sep 6, 2024 · 15 comments
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: DLPack DLPack.

Comments

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Sep 6, 2024

Motivation

The sparse (also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.

This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.

Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the sparse work.

Proposal

We propose introducing two new methods to the array-API compliant sparse array objects (such as those in sparse), which are described below.

__binsparse__

Returns a 2-tuple (binsparse_descriptor, constituent_arrays).

The first item is a dict equivalent to a parsed JSON binsparse descriptor of an array.

The second item is a list of __dlpack__ compatible arrays, which are the constituent arrays of the sparse array.

Introduction of from_binsparse function.

If a library supports sparse arrays, its from_binsparse method should support accepting (when possible, zero-copy) versions of objects that follow this __binsparse__ protocol, and have an equivalent sparse format within the library.

Psuedocode implementation

Here's a psuedocode example using two libraries, xp1 and xp2, both supporting sparse arrays:

# In library code:
xp2_sparray = xp2.from_binsparse(xp1_sparray, ...)

# This psuedocode impl is common between `xp1` and `xp2`
def from_binsparse(x: object, /, *, device: device | None = None, copy: bool | None = None) -> array:
    binsparse_impl = getattr(x, "__binsparse__", None)
    if binsparse_impl is None:
        raise TypeError(...)
    
    binsparse_descriptor, constituent_arrays = binsparse_impl()
    # Will raise an error if the format/descriptor is unsupported.
    sparse_type = _type_from_binsparse_descriptor(binsparse_descriptor)
    # Some reordering may need to happen somewhere to match internal order of constituent arrays
    my_constituent_arrays = tuple(
        from_dlpack(arr, device=device, copy=copy) for arr in constituent_arrays
    )
    return sparse_type.from_strided_arrays(my_constituent_arrays, shape=...)

Parallel implementation in sparse: pydata/sparse#764

Alternative solutions

There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange. binsparse is the one that comes closest to offering in-memory interchange.

Pinging possibly interested parties:

Updated on 2024.10.09 as agreed in #840 (comment).

@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. API extension Adds new functions or objects to the API. topic: DLPack DLPack. Needs Discussion Needs further discussion. labels Sep 6, 2024
@data-apis data-apis deleted a comment Sep 6, 2024
@pearu
Copy link

pearu commented Sep 6, 2024

A couple of quick notes.

First, dlpack and binsparse are self-contained specifications where dlpack provides a protocol for sharing strided arrays between different array libraries and binsparse provides a unified description of various sparse formats. This proposal tries to glue these together by implicitly extending DLPack protocol but that contradicts with the semantics of DLPack that "describes the memory layout of dense, strided, n-dimensional arrays". In addition, the Python Array API standard specifies that DLPack should not support sparse arrays.

My suggestion is to cook up a sparse array interchange protocol that may use dlpack and binsparse (as these are obviously relevant pieces of this problem) but in a cleaner way.

For instance, binsparse specifies that the binsparse-compatible object provides a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>). So, the appropriate place to deploy the dlpack protocol is to require that the arrays in the <list of sparse array data as strided arrays> must support the dlpack protocol.

For instance 2, introduce "Python specification of binsparse" (similar to one in dlpack) that consists of

  1. a from_binsparse function that accepts objects that implement the binsparse protocol per 2. below
  2. a __binsparse__ method on the objects representing sparse arrays that will return a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>) which is used by the from_binsparse function of the consumer library.

Also, the specification should include the C-side interface as well.

What do you think?

@hameerabbasi
Copy link
Contributor Author

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays. Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

I'm not opposed to a C-side specification, I'd welcome that so that the benefits can extend beyond the Python ecosystem. We do have to be mindful that the binsparse specification requires key-value maps, and we would need to maintain their relative order. Perhaps an iterable of 2-tuples makes more sense in this case.

@pearu
Copy link

pearu commented Sep 7, 2024

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays.

Absolutely. I presume that users of sparse arrays are aware of advantages and disadvantages of using a particular sparse format over other sparse formats or strided arrays. Using a particular sparse format is a choice of optimization method that must be supported by user-facing API.

Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

from_dlpack is specifically defined for strided arrays, both by the Python Array API and dlpack. So, I find abusing the from_dlpack function for sparse arrays more confusing than providing a new API function which would follow the same design pattern that dlpack introduced: the pair (from_foo, __foo__) provides a minimal API for sharing data between provider and consumer libraries that support a protocol foo.

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects. Although, Python Array API does not support that and existing implementations (PyTorch, JAX, ...) do not mix the functionalities of asarray and from_dlpack. I don't suggest to do that here either.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 9, 2024

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects.

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer; asarray requires one to maintain the ownership of the consumed array. This, in turn, means that there's no way to have a zero-copy transfer using asarray using the from_dlpack function; i.e. one would have to do something like def asarray(x, ...): return from_dlpack(copy(x), ...) somewhere to maintain ownership semantics, incurring a copy. See #840 (comment)

My intention was to have an API that essentially could support zero-copy interchange across libraries, regardless of whether the consumed array was strided, sparse or something else.

@pearu
Copy link

pearu commented Sep 9, 2024

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

@hameerabbasi
Copy link
Contributor Author

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

Ah, in that case; yes, asarray can use from_dlpack and a possible from_binsparse, and still be zero-copy. Thanks for the correction; asarray seems like the right format-agnostic API for this.

@hameerabbasi
Copy link
Contributor Author

I've updated the issue with @pearu's feedback.

@willow-ahrens
Copy link

I'd like to additionally CC the binsparse contributors so they're aware of this effort.
@BenBrock
@eriknw
@ivirshup
@jim22k
@DrTimothyAldenDavis

@kgryte
Copy link
Contributor

kgryte commented Sep 16, 2024

@hameerabbasi Are there any other alternative sparse interchange protocols that we should be considering, or is binsparse the only game in town? Given that this RFC proposes standardizing a particular approach, I'd be interested in at least covering our bases and having an understanding of other prior art.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 16, 2024

There are other on-disk/text formats: 1 2

Both of these are not designed for in-memory interchange of data -- and have no support for multiple sparse formats. I'll add these to the issue.

@willow-ahrens
Copy link

willow-ahrens commented Sep 18, 2024

There's also the FROSTT format, which stores tensors in a minimal text encoding and has the same limitations of the above.

http://frostt.io/

@rgommers
Copy link
Member

rgommers commented Sep 19, 2024

The relevant issue on the binsparse repo for an in-memory representation is GraphBLAS/binsparse-specification#4. That also mentions that Arrow has support for 2-D CSR/CSC/CSF/COO (no n-D, no chunking).

There's of course a bit of a chicken-and-egg problem here: it's hard to standardize anything that doesn't exist yet, but library implementors want to move ahead and implement something that is likely to be standardized later to avoid fragmentation or bc-breaking changes.

So here is what I think what should happen:

  • This issue should work out what the desired Python API will be for sparse in-memory interchange (the proposed analogue to DLPack with a dunder method like __binsparse__ sounds about right; whether more is needed a la __dlpack_device__ is TBD I think).
  • The actual binsparse in-memory spec should be worked out on GraphBLAS/binsparse-specification#4.
    • Seems more appropriate to discuss there rather than here, since that's the project that aims to define this and has more history and I assume will have additional constraints that aren't relevant for this repo, like aligment between the on-disk and in-memory metadata.
    • DLPack seems like a more appropriate choice than Arrow from the perspective of array libraries to me, due to broader support of relevant devices and dtypes, and being already in the array API standard and supported by all relevant array libraries. I can imagine that binsparse could decide to support both though, in case there's a real-world need for Arrow support.
  • Libraries should then go ahead and implement that.
  • Once we have multiple implementations that seem to work well, we move the Python API worked out in this issue into the standard.

@pearu
Copy link

pearu commented Sep 19, 2024

PyTorch has sparse formats that binsparse specifications does not specify as pre-defined formats. For instance, there are BSR/BSC (blocked CSR/CSC) formats, hybrid sparse formats (COO/CSR/CSC/BSR/BSC with values being strided tensors), sparse formats with batch dimensions (CSR/CSC/BSR/BSC with indices being multidimensional), and batch/hybrid sparse formats (combinations of hybrid and batched sparse formats).

So, for PyTorch, the usefulness of the in-memory sparse array interchange format based on binsparse specification will depend on how easy is to define these blocked/hybrid/batch sparse formats using binsparse custom format support, or better yet, on adding blocked/hybrid/batch sparse formats as pre-defined to binsparse specification.

@pearu
Copy link

pearu commented Sep 19, 2024

CC: @amjames (for torch.sparse)

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Sep 19, 2024

So, for PyTorch, the usefulness of the in-memory sparse array interchange format based on binsparse specification will depend on how easy is to define these blocked/hybrid/batch sparse formats using binsparse custom format support, or better yet, on adding blocked/hybrid/batch sparse formats as pre-defined to binsparse specification.

Since __binsparse__ is meant to be a mainly cross-library format, the usefulness is limited also by other implementations that support blocked/hybrid/batch formats. If PyTorch is the only implementation of these formats, then adding it to binsparse makes little sense, as there won't be another library to do the exchange with.

Incidentally, I do agree with the notion of blocked formats being supported. They are supported in the MLIR sparse_tensor dialect, and in SciPy as well. The cleanest way to support these seems to be an index map: A mapping from the 4-D virtual indices to 2-D indices using the operators +, -, *, //, % only, although they aren't supported yet. This also leads to supporting the diagonal format. I'll raise an issue on the binsparse repo and link it here.

xref GraphBLAS/binsparse-specification#56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes. topic: DLPack DLPack.
Projects
Status: Stage 1
Development

No branches or pull requests

5 participants