Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior with np.nan only initialized vectors #138

Closed
a-monsch opened this issue Aug 27, 2021 · 3 comments · Fixed by #139
Closed

Unexpected behavior with np.nan only initialized vectors #138

a-monsch opened this issue Aug 27, 2021 · 3 comments · Fixed by #139

Comments

@a-monsch
Copy link

a-monsch commented Aug 27, 2021

I am currently working with this package in combination with pandas on an assignment for study exercises in particle physics and have encountered an unexpected behavior. For a homogenized structure of my data I use np.nan to fill some missing values (particles). Consequently, it often comes to a situation where an initialization of a vector obj/array/... with np.nan occurs. This works perfectly fine, especially the property query, except for pseudorapidity, rapidity and other related quantities.

A small example:

import vector
import numpy as np

# initially, this part is stored inside a pandas DataFrame
v = np.array([[4.0, 3.0, 2.0, 1.0], [np.nan, np.nan, np.nan, np.nan]]).view(
                        [("E", float), ("px", float), ("py", float), ("pz", float)]).view(
                        vector.MomentumNumpy4D)


# this creates a [True, False] mask; as expected 
mask1 = np.abs(v.rapidity) < 2.5   

# This creates a [True, True] mask, since the second value is not np.nan but 0.0
mask2 = np.abs(v.pseudorapidity) < 2.5  

My expectation would be that, in this case, any quantity of an object that is initialized with (only) np.nan should also return a np.nan and not an actual number.

Is this behavior, especially for a case described above, intended?

P.S.: I know that via vector.awk there is a possibility to work with inhomogeneous data structures, but in our workgroup we decided against that and for an approach using pandas.

@jpivarski
Copy link
Member

The "zero vs nan" decisions are determined in part by coordinate system—if you're in cylindrical coordinates, for example, a nan in phi won't affect a calculation of (pseudo)rapidity—and they're partly determined by what ROOT's TMath::LorentzVector does.

Since you're in Cartesian coordinates, the relevant calculator for pseudorapidity is

def xy_z(lib, x, y, z):
return lib.nan_to_num(lib.arctanh(z / lib.sqrt(x ** 2 + y ** 2 + z ** 2)), nan=0.0)

and yes, that deliberately replaces nan with zero.

Looking again at ROOT's implementation, ours is not quite right, since they would have an all-nan vector return nan. ROOT either calculates it from rho and z:

https://github.com/root-project/root/blob/30ab704cbe645d2b2f36cb071905e462771df386/math/genvector/inc/Math/GenVector/eta.h#L47-L77

or from theta:

https://github.com/root-project/root/blob/30ab704cbe645d2b2f36cb071905e462771df386/math/genvector/inc/Math/GenVector/eta.h#L84-L96

What we were trying to avoid with the nan_to_num was the case in which 0/0 would return nan. Empirically, we can see that ROOT makes that substitution:

>>> ROOT.Math.XYZVector(0, 0, 0).Eta()
0.0
>>> ROOT.Math.PxPyPzEVector(0, 0, 0, 0).Eta()
0.0

Let's explore this more:

>>> ROOT.Math.XYZVector(0, 0, np.nan).Eta()
nan
>>> ROOT.Math.XYZVector(0, np.nan, 0).Eta()
0.0
>>> ROOT.Math.XYZVector(np.nan, 0, 0).Eta()
0.0
>>> ROOT.Math.PxPyPzEVector(0, 0, 0, np.nan).Eta()
0.0
>>> ROOT.Math.PxPyPzEVector(0, 0, np.nan, 0).Eta()
nan
>>> ROOT.Math.PxPyPzEVector(0, np.nan, 0, 0).Eta()
0.0
>>> ROOT.Math.PxPyPzEVector(np.nan, 0, 0, 0).Eta()
0.0

Only a nan in the z coordinate results in the expression being nan.

We have an additional constraint that our formulas need to be pure expressions (i.e. no "if" statements) since they need to be vectorizable in all current and future backends. I suppose we could get the "z is nan ⇒ output is nan" behavior back by multiplying our expression

def xy_z(lib, x, y, z):
return lib.nan_to_num(lib.arctanh(z / lib.sqrt(x ** 2 + y ** 2 + z ** 2)), nan=0.0)

by lib.absolute(lib.sign(z)):

>>> z = 0; np.absolute(np.sign(z))
0
>>> z = 1e-16; np.absolute(np.sign(z))
1.0
>>> z = -1e-16; np.absolute(np.sign(z))
1.0
>>> z = 1e16; np.absolute(np.sign(z))
1.0
>>> z = -1e16; np.absolute(np.sign(z))
1.0
>>> z = np.nan; np.absolute(np.sign(z))
nan
>>> z = np.inf; np.absolute(np.sign(z))
1.0
>>> z = -np.inf; np.absolute(np.sign(z))
1.0

There might be other cases where this is needed; we'd need to do a general survey. I think that's what #36 is.

@a-monsch
Copy link
Author

Thank you very much for the detailed and fast answer.

According to #36, you always compare with ROOT. In my first intention, this was not even the point, since it makes no sense to calculate anything at all from a vector that only contains np.nan, but nice to see that ROOT actually does it as expected, although only taking into account the z and phi components as you pointed out.

The solution with the usage of lib.absolute(lib.sign(z)) would fix the problem. Will this be implemented in the same or similar form in one of the next versions, or will it take longer, since the TODO list at #36 for the 4D vectors has not been addressed yet as far as I can see?

@jpivarski jpivarski linked a pull request Aug 30, 2021 that will close this issue
@jpivarski
Copy link
Member

I'm implementing the lib.absolute(lib.sign(z)) solution now. PR #36 is waylaid because the author had to work on other things. We won't get complete parity with ROOT until those tests are added, but we can fix this one that you noticed now: it will be implemented in PR #139.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants