Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

willtebbutt
Copy link
Member

@willtebbutt willtebbutt commented Aug 28, 2021

This PR is a proposal I've been working on to address the issues discussed (at length) in #441 .

To understand the PR, please consult notes.md first, and examples.jl once you've done that.

In particular, this proposal aims to achieve a clean separation between the world of structural tangents, which work well with AD, and natural tangents, which humans often prefer -- the advantages of such a set up are laid out in notes.md.

This is not as much of a beast as it initially seems, as most of the LOC are contained in notes.md and examples.jl, neither of which would ever be merged. Also, much of the code presently in src/destructure.jl will be moved to the tests before merging.

@willtebbutt willtebbutt changed the title A General Recipe for Generic Rules and Natural Tangents A General Recipe for Generic Rules and Natural Tangents (hopefully...) Aug 28, 2021
@willtebbutt willtebbutt changed the title A General Recipe for Generic Rules and Natural Tangents (hopefully...) RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) Aug 28, 2021
@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2021

Codecov Report

Merging #449 (20f7727) into master (8218c2c) will decrease coverage by 80.47%.
The diff coverage is 61.15%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #449       +/-   ##
===========================================
- Coverage   92.85%   12.38%   -80.48%     
===========================================
  Files          14       15        +1     
  Lines         784      945      +161     
===========================================
- Hits          728      117      -611     
- Misses         56      828      +772     
Impacted Files Coverage Δ
src/ChainRulesCore.jl 100.00% <ø> (ø)
src/rule_definition_tools.jl 1.13% <0.00%> (-94.84%) ⬇️
src/destructure.jl 71.42% <71.42%> (ø)
src/config.jl 0.00% <0.00%> (-100.00%) ⬇️
src/debug_mode.jl 0.00% <0.00%> (-100.00%) ⬇️
src/accumulation.jl 0.00% <0.00%> (-97.23%) ⬇️
src/projection.jl 0.95% <0.00%> (-97.12%) ⬇️
src/differential_arithmetic.jl 0.00% <0.00%> (-96.34%) ⬇️
src/rules.jl 0.00% <0.00%> (-90.91%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8218c2c...20f7727. Read the comment docs.

@willtebbutt
Copy link
Member Author

Tidied up examples.jl and added interesting example with Fill -- there is some freedom in the choice of implementation of (::Restructure) for Fill, for reasons discussed in examples.jl.

@willtebbutt
Copy link
Member Author

Added SArray example at the end of examples.jl.

@willtebbutt
Copy link
Member Author

Added a WoodburyPDMat example (see PDMatsExtras.jl) . I'm struggling to get the @opt_out macro to work for me -- @mzgubic @oxinabox any thoughts on the culprit? (See the end of examples.jl)

@willtebbutt
Copy link
Member Author

Turns out that the above problem was related to me not specifying that the rule in question was restricted to a CommutativeMulNumber. The example in question now works.

I've also added some examples with Kronecker.jl.

@willtebbutt
Copy link
Member Author

willtebbutt commented Sep 8, 2021

Adding extra tests for ScaledMatrix. Interestingly, it appears that the abstractions don't always work in this case -- seems to be that it's because ScaledMatrix is over-parametrised.

For under-parametrised matrices I'm fairly confident that we can think of the Jacobian of restructure as a left-inverse of the Jacobian of destructure (left-inverse should always exist I believe). This doesn't work for over-parametrised types because the Jacobian of destructure maps from a larger number of dims to a smaller number of dims, so doesn't admit a left-inverse (to see this, think about the rank of a matrix product of the form (high_dim, low_dim) * (low_dim, high_dim) vs the rank of the identity matrix of size (high_dim, high_dim)).

Not sure if there's a way to fix this. Doesn't seem to be an issue when either

  1. the pullback of restructure isn't needed, because the function doesn't output a ScaledMatrix, or
  2. the function is composed with another function which doesn't output a ScaledMatrix. Possibly this is essentially the same as the above?

Not sure what the geometric explanation of these last two points is. Presumably there's a nice one.

edit: possibly we can understand what's going on here by considering the pseudo-inverse of the Jacobian of destructure.

Assume that the Jacobian of restructure, J* is a pseudo-inverse for the Jacobian of destructure, J. If the J has dimensions (high_dim, low_dim) where high_dim > low_dim, then J* is a left-inverse for J.

Conversely, if J has dimensions (low_dim, high_dim) then there doesn't exist a left-inverse, but the pseudo-inverse should exist, and will satisfy v' J* J J* == v' J* and w' J J* J == w' J . I've confirmed that my implementation the pullback of destructure for ScaledMatrix satisfies these, but doesn't satisfy v' J* J == v'. i.e. it is a pseudo-inverse, but is not a left-inverse. This would also explain why if you have a function like

f(x::ScaledMatrix) = my_sum(my_scale(x))

where both my_sum and my_scale utilise natural pullbacks and my_scale just uses a scalar, so you don't see any problems. I imagine if I used a more general transformation that problems would appear, but in that case probably the output wouldn't be a ScaledMatrix, so maybe it would be fine?

edit2: There are a couple of ways that we could utilise this information for guiding rule-implementers. My inclination is just to suggest that type-authors avoid implementing pullback_of_restructure entirely if their type is overparametrised, and accept that they'll get access to most, but not all, generic rrules. The ones that they typically won't get access to are those for which they have specialised methods for anyway (otherwise it's hard to see how they would get their array type would be produced), and they can just opt out and ensure that these are differentiable.

edit3: a consequence of the left-inverse Jacoabian stuff (for under-parametrised matrices), is that the pullback of restructure is simply a right-inverse for the pullback of destructure, and the pushforward of restructure is a left-inverse for the pushforward of the pushforward of destructure. i.e

pushforward_restructure(pushforward_destructure(x)) == x
pullback_destructure(pullback_restructure(x)) == x

This gives a really nice way to derive what you need for restructure without thinking about how to actually implement restructure itself -- once you've found the pushforward / pullback for destructure, just find their left- and right-inverses respectively. Moreover, if these inverses don't exist, don't try to implement them, and accept that there are some kinds of rrules that you just can't use.

edit4: I've managed to show that, if you want a sensible definition of the natural pullback for the identity function, the pullback of restructure must be a right-inverse of the pullback of destructure. However, I've not managed to determine whether any right inverse will do. Would like to know if that's the case.

@oxinabox
Copy link
Member

I continue to want to submit a workshop paper about this, to get feedback.
I have thus put together an intro and background.


In languages featuring polymorphism the the representation of the derivative of types that represent matrixes poses a challange.
Languages such as Julia allow both a primative dense matrix as implemented as an Matrix (i.e. Array{T,2}) which subtype AbstractMatrix, and rich user extensible types as implemented by structures (classes) which also subtype AbstractMatrix.
This poses a problem as they now intrinstically have at least two possible representations of their (co-)tangents.
Firstly as structual tangent, based on them being structs which stores the tangents to each field.
Secondly, an natural tangent which is an AbstractMatrix which reflects the nature of the type as already respensting a vector space.
As we will discuss both of these representations are useful, and one can not reasonably require a system to just support one.
The challange thus exists, how to convert between these two representations.
The core insight of this paper is that the conversion operation is exactly the pullback of the collect operation that changes on of these rich struct-based AbstractMatrixs into a primative dense Matrix.

We can very simply define the idea of a AbstractMatrix subtype as something implementing the following interface:

  • eachindex(A) returning all indexes within the matrix A.
  • getindex(A,i) returning the element of A at index i

From this simple interface, all other expected functionality AbstractMatrix can be implemented.
Though additional specialization might be added to allow for improvements in speed or numerical accuracy, depending on the algorithm in question.

The collect operation which converts any AbstractMatrix subtype into a dense Matrix, can be implemented by cycling though all indices with eachindex and setting the matching index in a dense Matrix to match what getindex on the AbstractMatrix returns.
Though many AbstractMatrix types implement more specialized and faster methods to get the same result.

Background: Polymorphic AbstractMatrix subtype

These struct-based AbstractMatrix types are very useful.
They can massively reduce the memory that must be stored.
For example:

  • OneHot Matrix which represents a column-vector with only one nonzero element, which is set to 1.
    • It only has to store 1 index and the size.
    • getindex returns 0 for all indexes except the stored one.
  • Diagonal which represents a diagonal matrix
    • It only has to store a vector on the diagonal.
    • getindex returns 0 if asked for off diagonal ements
  • Identity which represents an identity matrix
    • It has to store only the size.
    • getindex returns 1 for on the diagonal and 0 if not.

They can be used to implement matrixes with structural sparisty allowing for algorithms with improved time complexity.
For example for matrix multiplication:

  • *(::Matrix, ::Matrix):
    • multiply rows by columns and sum.
    • $O(n^3)$ time
  • *(::Matrix, ::Diagonal) or *(::Diagonal, ::Dense):
    • column-wise/row-wise scaling.
    • $O(n^2)$ time.
  • *(::Matrix, ::OneHot):
    • row-wise slicing.
    • $O(n)$ time.
  • *(::Identity, ::Matrix) or *(::Matrix, ::Identity):
    • no change.
    • $O(1)$ time.

For example for sum of all elements:

  • sum(::Matrix):
    • must iterate every element.
    • $O(n^2)$ time
  • sum(::Diagonal)
    • sum only the stored elements from the diagonal
    • $O(n)$ time.
  • sum(::OneHot)
    • return 1
    • $O(1)$ time.
  • sum(::Identity)
    • return the size of either axes
    • $O(1)$ time.

Many other examples exist.
From block-sparse matrixes where blocks are zero, to symmetric and skew symmetric matrix, to fill matrixes always containing the same element.
Yet another use of these polymorphic matrix types is to carry metadata, such are axes names {CITE TensorsConsideredHarmful}.
These rich extensible matrix types are very useful for scientific computing.

Background: When do we want each tangent type

Structual tangents have two key uses, in addition to generally being a useful representation.
getfield, and constructors.

The pullback of the getfield operation must, return a structual tangent: in particular it is one that has propagated the co-tangent for the result, to be for that particular field.
getfield is not an array operation, it is an operation on structs.
It thus has no predetermined result for how it would be represented in a natural tangent.
getfield is a crucial operation to be able to differentiate though, as almost all AbstractMatrix subtypes will use it in their implementation of other operations, including getindex.

The constructor's pullback is naturally implemented in terms of the structural tangent.
Here we consider the "default construtor" which takes as it's inputs the values for each field, and returns a object of the type (rather than a overloaded constructor that might not, but which will eventally call the "default constructor").
The pullback for the constructor needs to take the co-tangent for the whole object and break it up into the tangents for each of the fields which it was input.
Which is trivially done on a structural tangent as it carried a tangent for each field.

However, a key limitation of the structural tangent is that it does not implement getindex.
This is incontrast to the natural tangent which being another AbstractArray type does.
Automatic Differentiation, dispite some claims, can not handle all code correctly or efficiently in practice.
For some algorithms a manually specified rule must be used, or can be used to give substantial speed-up.
Three examples are:

  • Derivatives though optimization problems, where the implict function theorem is used to determine the derivative
  • Derivatives through differentential equation solves, which use a variety of differential equation sensitivity methods; which often involve solving an exended set of equations for both the result and it's derivative.
  • Fast Fourier transforms, where the pullback is another kind of fast fourier transform, and so can also make use of the carefully optimized libraries.

All these examples require a co-tangent type that is natural and supports the operations an array does.

notes.md Show resolved Hide resolved
notes.md Outdated Show resolved Hide resolved

`destructure` is quite straightforward to define -- essentially equivalent to `collect`. I'm confident that this is always going to be simple to define, because `collect` is always easy to define.

`Restructure(C)(C_dense)` is a bit trickier. It's the function which takes an `Array` `C_dense` and transforms it into `C`. This feels like a slightly odd thing to do, since we already have `C`, but it's necessary to already know what `C` is in order to construct this function in general -- for example, over-parametrised matrices require this (see the `ScaledMatrix` example in the tests / examples). I'm _reasonably_ confident that this is always going to be possible to define, but I might have missed something.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do this for e.g. Woodbury, but I think you mentioned you thought about this and know how to do it?

Similarly, two different structured matrices might map to the same dense matrix. In that case I suppose it doesn't matter which structured matrix we get for the correctness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do this for e.g. Woodbury, but I think you mentioned you thought about this and know how to do it?

The more I think about it, the less convinced I am that I know how to do this. I think this section on Restructure needs re-working / updating to reflect the increased level of uncertainty I have about how easy this is to do. For example, the ScaledMatrix example I mention is less straightforward than I originally thought.

Similarly, two different structured matrices might map to the same dense matrix. In that case I suppose it doesn't matter which structured matrix we get for the correctness?

Yeah, I think that's totally fine. But is this a destructure thing rather than Restructure?

notes.md Outdated Show resolved Hide resolved
@mcabbott
Copy link
Member

mcabbott commented Nov 3, 2021

This business of how to map between "natural" and "structural" representations was bugging me, but I now think it's actually very simple:

Think of the forward pass as maps W(fields) -> S(subspace) -> M(matrices). The first is roughly the contructor, S is submanifold of the space of all matrices, and the second map just embeds that. The pullback of this whole chain always exists, it's what is called pullback_of_destructure here. It takes a cotangent in T_x^*M and gives one in T_w^*W, i.e. a Tangent. And it's easy to make AD do this for us, it's what Zygote.pullback(collect, Struct(fields...)) will do if you avoid enough generic rules.

The pullback of only the second map also always exists; it's a linear map from T_x^*M to T_x^*S. The result of this is what we mean by a natural representation. There is no freedom. There is also no guarantee that that this will be something you can represent by the same struct. (What that can represent might not even be a vector space, e.g. rotation matrices.)

The apparent ambiguity described here in how to pick the natural for Fill is just because one of them isn't in T_x^*S. The ambiguity in ScaledVector is just becuase there's a fixed vector to describe, for which this struct allows various objects which are ==; but in this case there's no advantage to wrapping that vector in this struct: S is the full space.

In general pullback_of_destructure maps into a subspace of T_w^*W. There's a bijection from this subspace to the natural representation. But I don't think this has to be cheap. For instance the one for a lazy kron seems to need \. This bijection can be extended to a map from the whole T_w^*W to T_x^*S. I think the unique way to do that is to demand it be unaffected by the perpendicular directions. For Symmetric, this means demanding that it not depend on the lower-triangle of the Tangent{Symmetric}(...).data (which is always zero anyway if this comes from pullback_of_destructure).

In this PR's description, I remain a bit confused by what the map Restructure does. What defines it, or what properties constrain it? Does it act on the manifold or the cotangent space? But pullback_of_restructure looks like it might be the map from T_w^*W to T_x^*S.

@willtebbutt
Copy link
Member Author

willtebbutt commented Nov 4, 2021

The pullback of this whole chain always exists, it's what is called pullback_of_destructure here. It takes a cotangent in T_x^*M and gives one in T_w^*W, i.e. a Tangent. And it's easy to make AD do this for us, it's what Zygote.pullback(collect, Struct(fields...)) will do if you avoid enough generic rules.

Exactly. Roughly speaking, I had been thinking about this as

  • pushforward map of collect tells us how to convert structural tangent -> natural tangent
  • pullback map of collect tells us how to convert natural cotangent -> structural cotangent

I think it's basically fine to think about collect and destructure as being the same thing. It's just that we might want (the pushforward map associated with)destructure to produce a CuArray sometimes, rather than an Array, which we don't get to do if we use collect.

I'm struggling a little to see exactly what you mean by the intermediate step though. In particular, how are you thinking about S(subspace)? In my mental model, only W(fields) and M(matrices) exist, so I think we're currently thinking about this slightly differently.

In this PR's description, I remain a bit confused by what the map Restructure does. What defines it, or what properties constrain it? Does it act on the manifold or the cotangent space? But pullback_of_restructure looks like it might be the map from T_w^*W to T_x^*S.

It acts on the manifold. Honestly though, I'm really not sure that it always exists, and I'm still not 100% sure what its semantics are.
The intention was to make restructure say "I know that the original forwards pass produces Diagonal. So the modified forwards-pass, in which we use Matrixs will have to produce a Matrix which is diagonal, so I'm free to convert it into a Diagonal".

It's probably easiest to ignore this for now, because it's a bit of a pain and I'm not really even sure how to derive this function in general (it could be something ugly and nonlinear, and I'm not entirely confident that it's uniquely defined). I'm confident that what's left is unambiguous, even if we're potentially missing a bit of functionality, so let's stick with that. We can revisit restructuring later if we like.

@willtebbutt
Copy link
Member Author

(Wrote this before I saw your doc on slack -- will try to read today)

@mcabbott
Copy link
Member

mcabbott commented Nov 4, 2021

to produce a CuArray sometimes

I guess we never actually want to produce this, for converting natural to structural. We want just pullback(collect, ...)[2] without making the [1] at all. So we could just call pullback(getindex, A, i)[2](dA[i]) a bunch of times. The two obstacles right now are (a) that this will hit a generic rule, and (b) that it'll allocate about N^2 for a length N array.

For (a) I wonder a bit if we want a call-site opt-out mechanism. Like rrule_via_ad(cfg, getindex, A, i; ignore=MyType) or something? Or would all types on which we want to perform natural-to-structural globally opt out of the getindex rule? (In Zygote right now, it's an @adjoint rule not an rrule, but that could change.)

For (b), I guess we need in-place accumulation, either by something like FluxML/Zygote.jl#981 or by rrule(::getindex, ::Array, ...) making an InplaceThunk. Or, perhaps you could work around this and write a function which takes x::MyType, makes an all-zero Tangent{MyType}, and then writes into that. It knows it's going to call getindex everywhere, no need to handle the result of each separately... it's a bit like map.

And, obstacle (c), it's difficult to see this working well at all for CuArrays. I guess that's like most generic fallbacks. A hand-written uncollect would often involve array-level operations and thus work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants