-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse the analysis in SpSV and SpSM #14
Conversation
François and Michel, I updated the file Note that we must allocate two buffers instead of one now. |
Thank you for taking care of this. A workaround might be calling the analysis phase again right-after calling the function |
I also did a quick benchmark on the instances we are interested in : using LinearAlgebra
using CUDA
using CUDA.CUSPARSE
using MatrixMarket
using CUSOLVERRF
Gx = mmread("Gx_case9241pegase.mtx")
Gu = mmread("Gu_case9241pegase.mtx")
nx = size(Gx, 1)
Gd = CuSparseMatrixCSR(Gx)
nrhs = 256
X = zeros(nx, nrhs)
X .= Gu[:, 1:nrhs]
Xd = CuMatrix(X)
Yd = CUDA.zeros(Float64, nx, nrhs)
rf = CUSOLVERRF.RFLU(Gd; symbolic=:KLU, nrhs=nrhs)
CUDA.@time ldiv!(Yd, rf, Xd)
CUDA.@time ldiv!(Yd, rf', Xd)
Using CUDA runtime 11.8, we get:
Using CUDA runtime 12.3, we have with this PR:
so we should expect a factor of 3 in the performance using the new CUDA API. |
If we call again the analysis phase after the function |
Did you permute the versions 11.8 and 12.3 François or is it really 3 times slower with the new API? |
I would suggest to do that, till NVIDIA provides a function to update the SpSM descriptor. Would be ok doing that?
I just permuted the versions 11.8 and 12.3, nothing to do with this PR: if I detail the timing, I observe that the new function |
Yes, I'm in the train for Argonne right now but I can do that after our meeting this afternoon.
I highly suspect that this factor 3 comes for an in-place version of the new API. I suppose that they added an unoptimized in-place version to not break code that was using the old API. |
Sure, I don't think we are in a rush to solve this issue.
Interesting, that would make sense indeed. That would advocate for developing an out-of-place triangular solve in CUSOLVERRF. |
@frapac using LinearAlgebra
using CUDA
using CUDA.CUSPARSE
using MatrixMarket
using CUSOLVERRF
Gx = mmread("Gx_case_ACTIVSg70k.mtx")
Gu = mmread("Gu_case_ACTIVSg70k.mtx")
nx = size(Gx, 1)
Gd = CuSparseMatrixCSR(Gx)
nrhs = 256
X = zeros(nx, nrhs)
X .= Gu[:, 1:nrhs]
Xd = CuMatrix(X)
Yd = CUDA.zeros(Float64, nx, nrhs)
CUDA.@profile ldiv!(Xd, UpperTriangular(Gd), Yd) CUDA v11.8:
CUDA v12.3:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Before merging this, we need to discuss where to update the analysis for CuSparseSM
. I am not sure if it is better to do that in src/backsolve.jl
(as implemented right now) or directly in src/interface.jl
when calling the function lu!
(see comment below).
CUSPARSE.cusparseSpSV_analysis( | ||
CUSPARSE.handle(), transa, Ref{T}(alpha), descL, descX, descX, T, algo, spsv_L, buffer_L, | ||
) | ||
CUSPARSE.cusparseSpSV_analysis( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand why we needs two buffers. Are buffer_L
and buffer_U
used implicitly in cusparseSpSV_solve
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are used implicitly by cusparseSpSV_solve
. cusparseSpSV_analysis
updates the pointer of the buffer
in the SpSV
descriptor.
src/backsolve.jl
Outdated
chktrans(transa) | ||
|
||
# CUDA 12 is required for inplace computation in cusparseSpSM | ||
@assert CUDA.runtime_version() ≥ v"12.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update to v"12.3"
as this is required after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/backsolve.jl
Outdated
[(s.descU, s.infoU), (s.descL, s.infoL)] | ||
end | ||
|
||
operations = s.transa == 'N' ? [(s.descL, s.infoL), (s.descU, s.infoU)] : [(s.descU, s.infoU), (s.descL, s.infoL)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the interface remains the same as before for CuSparseSM
. How about using cusparseSpSM_solve
in that function, and updating the analysis apart in the interface RFLU
?
https://github.com/exanauts/CUSOLVERRF.jl/blob/master/src/interface.jl#L85
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you call multiple time SpSM
without updating the rf
structure?
In that case we can reuse the buffer but otherwise we will gain nothing to move the computation of the SpSM
analyis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we call SpSM
multiple times inside a for loop (on large instances we cannot instantiate a buffer large enough ).
I got an answer from NVIDIA: |
This PR looks good to merge once we have addressed the update of |
It should be good now François. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
close #11
@michel2323 @frapac