RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

willtebbutt · 2021-08-28T21:56:18Z

This PR is a proposal I've been working on to address the issues discussed (at length) in #441 .

To understand the PR, please consult notes.md first, and examples.jl once you've done that.

In particular, this proposal aims to achieve a clean separation between the world of structural tangents, which work well with AD, and natural tangents, which humans often prefer -- the advantages of such a set up are laid out in notes.md.

This is not as much of a beast as it initially seems, as most of the LOC are contained in notes.md and examples.jl, neither of which would ever be merged. Also, much of the code presently in src/destructure.jl will be moved to the tests before merging.

codecov-commenter · 2021-08-28T21:58:08Z

Codecov Report

Merging #449 (20f7727) into master (8218c2c) will decrease coverage by 80.47%.
The diff coverage is 61.15%.

@@             Coverage Diff             @@
##           master     #449       +/-   ##
===========================================
- Coverage   92.85%   12.38%   -80.48%     
===========================================
  Files          14       15        +1     
  Lines         784      945      +161     
===========================================
- Hits          728      117      -611     
- Misses         56      828      +772

Impacted Files	Coverage Δ
src/ChainRulesCore.jl	`100.00% <ø> (ø)`
src/rule_definition_tools.jl	`1.13% <0.00%> (-94.84%)`	⬇️
src/destructure.jl	`71.42% <71.42%> (ø)`
src/config.jl	`0.00% <0.00%> (-100.00%)`	⬇️
src/debug_mode.jl	`0.00% <0.00%> (-100.00%)`	⬇️
src/accumulation.jl	`0.00% <0.00%> (-97.23%)`	⬇️
src/projection.jl	`0.95% <0.00%> (-97.12%)`	⬇️
src/differential_arithmetic.jl	`0.00% <0.00%> (-96.34%)`	⬇️
src/rules.jl	`0.00% <0.00%> (-90.91%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8218c2c...20f7727. Read the comment docs.

willtebbutt · 2021-08-29T11:05:19Z

Tidied up examples.jl and added interesting example with Fill -- there is some freedom in the choice of implementation of (::Restructure) for Fill, for reasons discussed in examples.jl.

willtebbutt · 2021-08-29T11:24:54Z

Added SArray example at the end of examples.jl.

willtebbutt · 2021-08-31T15:53:42Z

Added a WoodburyPDMat example (see PDMatsExtras.jl) . I'm struggling to get the @opt_out macro to work for me -- @mzgubic @oxinabox any thoughts on the culprit? (See the end of examples.jl)

willtebbutt · 2021-09-05T13:01:42Z

Turns out that the above problem was related to me not specifying that the rule in question was restricted to a CommutativeMulNumber. The example in question now works.

I've also added some examples with Kronecker.jl.

willtebbutt · 2021-09-08T10:12:54Z

Adding extra tests for ScaledMatrix. Interestingly, it appears that the abstractions don't always work in this case -- seems to be that it's because ScaledMatrix is over-parametrised.

For under-parametrised matrices I'm fairly confident that we can think of the Jacobian of restructure as a left-inverse of the Jacobian of destructure (left-inverse should always exist I believe). This doesn't work for over-parametrised types because the Jacobian of destructure maps from a larger number of dims to a smaller number of dims, so doesn't admit a left-inverse (to see this, think about the rank of a matrix product of the form (high_dim, low_dim) * (low_dim, high_dim) vs the rank of the identity matrix of size (high_dim, high_dim)).

Not sure if there's a way to fix this. Doesn't seem to be an issue when either

the pullback of restructure isn't needed, because the function doesn't output a ScaledMatrix, or
the function is composed with another function which doesn't output a ScaledMatrix. Possibly this is essentially the same as the above?

Not sure what the geometric explanation of these last two points is. Presumably there's a nice one.

edit: possibly we can understand what's going on here by considering the pseudo-inverse of the Jacobian of destructure.

Assume that the Jacobian of restructure, J* is a pseudo-inverse for the Jacobian of destructure, J. If the J has dimensions (high_dim, low_dim) where high_dim > low_dim, then J* is a left-inverse for J.

Conversely, if J has dimensions (low_dim, high_dim) then there doesn't exist a left-inverse, but the pseudo-inverse should exist, and will satisfy v' J* J J* == v' J* and w' J J* J == w' J . I've confirmed that my implementation the pullback of destructure for ScaledMatrix satisfies these, but doesn't satisfy v' J* J == v'. i.e. it is a pseudo-inverse, but is not a left-inverse. This would also explain why if you have a function like

f(x::ScaledMatrix) = my_sum(my_scale(x))

where both my_sum and my_scale utilise natural pullbacks and my_scale just uses a scalar, so you don't see any problems. I imagine if I used a more general transformation that problems would appear, but in that case probably the output wouldn't be a ScaledMatrix, so maybe it would be fine?

edit2: There are a couple of ways that we could utilise this information for guiding rule-implementers. My inclination is just to suggest that type-authors avoid implementing pullback_of_restructure entirely if their type is overparametrised, and accept that they'll get access to most, but not all, generic rrules. The ones that they typically won't get access to are those for which they have specialised methods for anyway (otherwise it's hard to see how they would get their array type would be produced), and they can just opt out and ensure that these are differentiable.

edit3: a consequence of the left-inverse Jacoabian stuff (for under-parametrised matrices), is that the pullback of restructure is simply a right-inverse for the pullback of destructure, and the pushforward of restructure is a left-inverse for the pushforward of the pushforward of destructure. i.e

pushforward_restructure(pushforward_destructure(x)) == x
pullback_destructure(pullback_restructure(x)) == x

This gives a really nice way to derive what you need for restructure without thinking about how to actually implement restructure itself -- once you've found the pushforward / pullback for destructure, just find their left- and right-inverses respectively. Moreover, if these inverses don't exist, don't try to implement them, and accept that there are some kinds of rrules that you just can't use.

edit4: I've managed to show that, if you want a sensible definition of the natural pullback for the identity function, the pullback of restructure must be a right-inverse of the pullback of destructure. However, I've not managed to determine whether any right inverse will do. Would like to know if that's the case.

oxinabox · 2021-09-12T15:11:59Z

I continue to want to submit a workshop paper about this, to get feedback.
I have thus put together an intro and background.

In languages featuring polymorphism the the representation of the derivative of types that represent matrixes poses a challange.
Languages such as Julia allow both a primative dense matrix as implemented as an Matrix (i.e. Array{T,2}) which subtype AbstractMatrix, and rich user extensible types as implemented by structures (classes) which also subtype AbstractMatrix.
This poses a problem as they now intrinstically have at least two possible representations of their (co-)tangents.
Firstly as structual tangent, based on them being structs which stores the tangents to each field.
Secondly, an natural tangent which is an AbstractMatrix which reflects the nature of the type as already respensting a vector space.
As we will discuss both of these representations are useful, and one can not reasonably require a system to just support one.
The challange thus exists, how to convert between these two representations.
The core insight of this paper is that the conversion operation is exactly the pullback of the collect operation that changes on of these rich struct-based AbstractMatrixs into a primative dense Matrix.

We can very simply define the idea of a AbstractMatrix subtype as something implementing the following interface:

eachindex(A) returning all indexes within the matrix A.
getindex(A,i) returning the element of A at index i

From this simple interface, all other expected functionality AbstractMatrix can be implemented.
Though additional specialization might be added to allow for improvements in speed or numerical accuracy, depending on the algorithm in question.

The collect operation which converts any AbstractMatrix subtype into a dense Matrix, can be implemented by cycling though all indices with eachindex and setting the matching index in a dense Matrix to match what getindex on the AbstractMatrix returns.
Though many AbstractMatrix types implement more specialized and faster methods to get the same result.

Background: Polymorphic AbstractMatrix subtype

These struct-based AbstractMatrix types are very useful.
They can massively reduce the memory that must be stored.
For example:

OneHot Matrix which represents a column-vector with only one nonzero element, which is set to 1.
- It only has to store 1 index and the size.
- getindex returns 0 for all indexes except the stored one.
Diagonal which represents a diagonal matrix
- It only has to store a vector on the diagonal.
- getindex returns 0 if asked for off diagonal ements
Identity which represents an identity matrix
- It has to store only the size.
- getindex returns 1 for on the diagonal and 0 if not.

They can be used to implement matrixes with structural sparisty allowing for algorithms with improved time complexity.
For example for matrix multiplication:

*(::Matrix, ::Matrix):
- multiply rows by columns and sum.
- $O(n^3)$ time
*(::Matrix, ::Diagonal) or *(::Diagonal, ::Dense):
- column-wise/row-wise scaling.
- $O(n^2)$ time.
*(::Matrix, ::OneHot):
- row-wise slicing.
- $O(n)$ time.
*(::Identity, ::Matrix) or *(::Matrix, ::Identity):
- no change.
- $O(1)$ time.

For example for sum of all elements:

sum(::Matrix):
- must iterate every element.
- $O(n^2)$ time
sum(::Diagonal)
- sum only the stored elements from the diagonal
- $O(n)$ time.
sum(::OneHot)
- return 1
- $O(1)$ time.
sum(::Identity)
- return the size of either axes
- $O(1)$ time.

Many other examples exist.
From block-sparse matrixes where blocks are zero, to symmetric and skew symmetric matrix, to fill matrixes always containing the same element.
Yet another use of these polymorphic matrix types is to carry metadata, such are axes names {CITE TensorsConsideredHarmful}.
These rich extensible matrix types are very useful for scientific computing.

Background: When do we want each tangent type

Structual tangents have two key uses, in addition to generally being a useful representation.
getfield, and constructors.

The pullback of the getfield operation must, return a structual tangent: in particular it is one that has propagated the co-tangent for the result, to be for that particular field.
getfield is not an array operation, it is an operation on structs.
It thus has no predetermined result for how it would be represented in a natural tangent.
getfield is a crucial operation to be able to differentiate though, as almost all AbstractMatrix subtypes will use it in their implementation of other operations, including getindex.

The constructor's pullback is naturally implemented in terms of the structural tangent.
Here we consider the "default construtor" which takes as it's inputs the values for each field, and returns a object of the type (rather than a overloaded constructor that might not, but which will eventally call the "default constructor").
The pullback for the constructor needs to take the co-tangent for the whole object and break it up into the tangents for each of the fields which it was input.
Which is trivially done on a structural tangent as it carried a tangent for each field.

However, a key limitation of the structural tangent is that it does not implement getindex.
This is incontrast to the natural tangent which being another AbstractArray type does.
Automatic Differentiation, dispite some claims, can not handle all code correctly or efficiently in practice.
For some algorithms a manually specified rule must be used, or can be used to give substantial speed-up.
Three examples are:

Derivatives though optimization problems, where the implict function theorem is used to determine the derivative
Derivatives through differentential equation solves, which use a variety of differential equation sensitivity methods; which often involve solving an exended set of equations for both the result and it's derivative.
Fast Fourier transforms, where the pullback is another kind of fast fourier transform, and so can also make use of the carefully optimized libraries.

All these examples require a co-tangent type that is natural and supports the operations an array does.

notes.md

mzgubic · 2021-09-14T11:52:13Z

notes.md

+
+`destructure` is quite straightforward to define -- essentially equivalent to `collect`. I'm confident that this is always going to be simple to define, because `collect` is always easy to define.
+
+`Restructure(C)(C_dense)` is a bit trickier. It's the function which takes an `Array` `C_dense` and transforms it into `C`. This feels like a slightly odd thing to do, since we already have `C`, but it's necessary to already know what `C` is in order to construct this function in general -- for example, over-parametrised matrices require this (see the `ScaledMatrix` example in the tests / examples). I'm _reasonably_ confident that this is always going to be possible to define, but I might have missed something.


I don't know how to do this for e.g. Woodbury, but I think you mentioned you thought about this and know how to do it?

Similarly, two different structured matrices might map to the same dense matrix. In that case I suppose it doesn't matter which structured matrix we get for the correctness?

I don't know how to do this for e.g. Woodbury, but I think you mentioned you thought about this and know how to do it?

The more I think about it, the less convinced I am that I know how to do this. I think this section on Restructure needs re-working / updating to reflect the increased level of uncertainty I have about how easy this is to do. For example, the ScaledMatrix example I mention is less straightforward than I originally thought.

Similarly, two different structured matrices might map to the same dense matrix. In that case I suppose it doesn't matter which structured matrix we get for the correctness?

Yeah, I think that's totally fine. But is this a destructure thing rather than Restructure?

notes.md

mcabbott · 2021-11-03T23:51:17Z

This business of how to map between "natural" and "structural" representations was bugging me, but I now think it's actually very simple:

Think of the forward pass as maps W(fields) -> S(subspace) -> M(matrices). The first is roughly the contructor, S is submanifold of the space of all matrices, and the second map just embeds that. The pullback of this whole chain always exists, it's what is called pullback_of_destructure here. It takes a cotangent in T_x^*M and gives one in T_w^*W, i.e. a Tangent. And it's easy to make AD do this for us, it's what Zygote.pullback(collect, Struct(fields...)) will do if you avoid enough generic rules.

The pullback of only the second map also always exists; it's a linear map from T_x^*M to T_x^*S. The result of this is what we mean by a natural representation. There is no freedom. There is also no guarantee that that this will be something you can represent by the same struct. (What that can represent might not even be a vector space, e.g. rotation matrices.)

The apparent ambiguity described here in how to pick the natural for Fill is just because one of them isn't in T_x^*S. The ambiguity in ScaledVector is just becuase there's a fixed vector to describe, for which this struct allows various objects which are ==; but in this case there's no advantage to wrapping that vector in this struct: S is the full space.

In general pullback_of_destructure maps into a subspace of T_w^*W. There's a bijection from this subspace to the natural representation. But I don't think this has to be cheap. For instance the one for a lazy kron seems to need \. This bijection can be extended to a map from the whole T_w^*W to T_x^*S. I think the unique way to do that is to demand it be unaffected by the perpendicular directions. For Symmetric, this means demanding that it not depend on the lower-triangle of the Tangent{Symmetric}(...).data (which is always zero anyway if this comes from pullback_of_destructure).

In this PR's description, I remain a bit confused by what the map Restructure does. What defines it, or what properties constrain it? Does it act on the manifold or the cotangent space? But pullback_of_restructure looks like it might be the map from T_w^*W to T_x^*S.

willtebbutt · 2021-11-04T09:48:35Z

The pullback of this whole chain always exists, it's what is called pullback_of_destructure here. It takes a cotangent in T_x^*M and gives one in T_w^*W, i.e. a Tangent. And it's easy to make AD do this for us, it's what Zygote.pullback(collect, Struct(fields...)) will do if you avoid enough generic rules.

Exactly. Roughly speaking, I had been thinking about this as

pushforward map of collect tells us how to convert structural tangent -> natural tangent
pullback map of collect tells us how to convert natural cotangent -> structural cotangent

I think it's basically fine to think about collect and destructure as being the same thing. It's just that we might want (the pushforward map associated with)destructure to produce a CuArray sometimes, rather than an Array, which we don't get to do if we use collect.

I'm struggling a little to see exactly what you mean by the intermediate step though. In particular, how are you thinking about S(subspace)? In my mental model, only W(fields) and M(matrices) exist, so I think we're currently thinking about this slightly differently.

In this PR's description, I remain a bit confused by what the map Restructure does. What defines it, or what properties constrain it? Does it act on the manifold or the cotangent space? But pullback_of_restructure looks like it might be the map from T_w^*W to T_x^*S.

It acts on the manifold. Honestly though, I'm really not sure that it always exists, and I'm still not 100% sure what its semantics are.
The intention was to make restructure say "I know that the original forwards pass produces Diagonal. So the modified forwards-pass, in which we use Matrixs will have to produce a Matrix which is diagonal, so I'm free to convert it into a Diagonal".

It's probably easiest to ignore this for now, because it's a bit of a pain and I'm not really even sure how to derive this function in general (it could be something ugly and nonlinear, and I'm not entirely confident that it's uniquely defined). I'm confident that what's left is unambiguous, even if we're potentially missing a bit of functionality, so let's stick with that. We can revisit restructuring later if we like.

willtebbutt · 2021-11-04T09:55:53Z

(Wrote this before I saw your doc on slack -- will try to read today)

mcabbott · 2021-11-04T14:10:54Z

to produce a CuArray sometimes

I guess we never actually want to produce this, for converting natural to structural. We want just pullback(collect, ...)[2] without making the [1] at all. So we could just call pullback(getindex, A, i)[2](dA[i]) a bunch of times. The two obstacles right now are (a) that this will hit a generic rule, and (b) that it'll allocate about N^2 for a length N array.

For (a) I wonder a bit if we want a call-site opt-out mechanism. Like rrule_via_ad(cfg, getindex, A, i; ignore=MyType) or something? Or would all types on which we want to perform natural-to-structural globally opt out of the getindex rule? (In Zygote right now, it's an @adjoint rule not an rrule, but that could change.)

For (b), I guess we need in-place accumulation, either by something like FluxML/Zygote.jl#981 or by rrule(::getindex, ::Array, ...) making an InplaceThunk. Or, perhaps you could work around this and write a function which takes x::MyType, makes an all-zero Tangent{MyType}, and then writes into that. It knows it's going to call getindex everywhere, no need to handle the result of each separately... it's a bit like map.

And, obstacle (c), it's difficult to see this working well at all for CuArrays. I guess that's like most generic fallbacks. A hand-written uncollect would often involve array-level operations and thus work fine.

willtebbutt added 14 commits August 22, 2021 10:15

Upgrade FiniteDifferences

2d70966

Initial implementation

f530868

Some more work

31f67d3

Add testset comment

4b554c5

Examples and notes

a51abe9

Some work

8205ce6

Tidy up examples

bccece2

Tidy up PR notes

569f064

Provide optimised pullback implementations

f321a97

Change pullback implementation specifications

8a06561

Add extra methods to avoid config

72fc725

Fix typo

fbaaf86

Tweak comment in examples

313590d

Update notes

321fceb

willtebbutt changed the title ~~A General Recipe for Generic Rules and Natural Tangents~~ A General Recipe for Generic Rules and Natural Tangents (hopefully...) Aug 28, 2021

willtebbutt changed the title ~~A General Recipe for Generic Rules and Natural Tangents (hopefully...)~~ RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) Aug 28, 2021

Tweak notes

a15f469

willtebbutt added 11 commits August 28, 2021 22:59

Fix typo:

22efec3

Tweak notes

2b0853e

Tweak notes

73a5fb7

Tweak notes

ddb3237

Clarify notes

34e43d6

Clarify notes

c773045

Clarify notes

eaaeba7

Tweak notes

d94514d

Tweak notes

d571ad1

Tweak examples comments

d87d5c4

Tidy up + add Fill example

a7eb01c

Add SArray example

e691b6a

willtebbutt added 3 commits August 29, 2021 12:45

Add note on Symmetric restructure

d0b0eb0

Add UpperTriangular example with my_mul

d6eef62

Add WoodburyPDMat example

a4ce7d2

willtebbutt added 3 commits September 1, 2021 23:25

Note that natural tangent addition is fine

a880752

Add another examples comment

6cada13

Add Kronecker example and fix WoodburPDMat example

8f06364

Add note on side-effects

5647388

willtebbutt mentioned this pull request Sep 6, 2021

Making Testing More Automatic JuliaDiff/ChainRulesTestUtils.jl#213

Open

mzgubic reviewed Sep 14, 2021

View reviewed changes

willtebbutt added 2 commits September 14, 2021 17:01

Update notes.md

6d9d01f

Update notes.md

20f7727

oxinabox mentioned this pull request Sep 15, 2021

Allow ProjectTo on non-differential types? #442

Closed

AlexRobson mentioned this pull request Sep 21, 2021

WIP: Add ChainRules invenia/PDMatsExtras.jl#21

Closed

ToucheSir mentioned this pull request Oct 12, 2021

Does functor have the right semantics for Flux? FluxML/Functors.jl#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

willtebbutt commented Aug 28, 2021 •

edited

Loading

codecov-commenter commented Aug 28, 2021 •

edited

Loading

willtebbutt commented Aug 29, 2021

willtebbutt commented Aug 29, 2021

willtebbutt commented Aug 31, 2021

willtebbutt commented Sep 5, 2021

willtebbutt commented Sep 8, 2021 •

edited

Loading

oxinabox commented Sep 12, 2021

mzgubic Sep 14, 2021

willtebbutt Sep 14, 2021

mcabbott commented Nov 3, 2021

willtebbutt commented Nov 4, 2021 •

edited

Loading

willtebbutt commented Nov 4, 2021

mcabbott commented Nov 4, 2021


		`destructure` is quite straightforward to define -- essentially equivalent to `collect`. I'm confident that this is always going to be simple to define, because `collect` is always easy to define.

		`Restructure(C)(C_dense)` is a bit trickier. It's the function which takes an `Array` `C_dense` and transforms it into `C`. This feels like a slightly odd thing to do, since we already have `C`, but it's necessary to already know what `C` is in order to construct this function in general -- for example, over-parametrised matrices require this (see the `ScaledMatrix` example in the tests / examples). I'm _reasonably_ confident that this is always going to be possible to define, but I might have missed something.

RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

Are you sure you want to change the base?

RFC: A General Recipe for Generic Rules and Natural Tangents (hopefully...) #449

Conversation

willtebbutt commented Aug 28, 2021 • edited Loading

codecov-commenter commented Aug 28, 2021 • edited Loading

Codecov Report

willtebbutt commented Aug 29, 2021

willtebbutt commented Aug 29, 2021

willtebbutt commented Aug 31, 2021

willtebbutt commented Sep 5, 2021

willtebbutt commented Sep 8, 2021 • edited Loading

oxinabox commented Sep 12, 2021

Background: Polymorphic AbstractMatrix subtype

Background: When do we want each tangent type

mzgubic Sep 14, 2021

Choose a reason for hiding this comment

willtebbutt Sep 14, 2021

Choose a reason for hiding this comment

mcabbott commented Nov 3, 2021

willtebbutt commented Nov 4, 2021 • edited Loading

willtebbutt commented Nov 4, 2021

mcabbott commented Nov 4, 2021

willtebbutt commented Aug 28, 2021 •

edited

Loading

codecov-commenter commented Aug 28, 2021 •

edited

Loading

willtebbutt commented Sep 8, 2021 •

edited

Loading

willtebbutt commented Nov 4, 2021 •

edited

Loading