Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse the analysis in SpSV and SpSM #14

Merged
merged 12 commits into from
Nov 15, 2023
Merged

Conversation

amontoison
Copy link
Member

@amontoison
Copy link
Member Author

amontoison commented Nov 9, 2023

François and Michel, I updated the file backsolve.jl such that we can reuse the analysis for multiple triangular solves.
It works well in all cases if we don't update the sparse triangular matrice -- just update the right-hand side(s).
If the sparse triangular matrices are also updated, which is the case here because we overwrite the matrix that contains the factors of the LU decomposition, we need to update the SpSV and SpSM descriptors.
NVIDIA provides a routine to do that but only for the SpSV descriptor...
I will send a mail to Yuki about it now.

Note that we must allocate two buffers instead of one now.
We allocate a buffer specific to the backward sweep and another one for the forward sweep.

@amontoison
Copy link
Member Author

JuliaGPU/CUDA.jl#2159

@amontoison amontoison changed the title Spsv spsm Reuse the analysis in SpSV and SpSM Nov 9, 2023
@frapac
Copy link
Member

frapac commented Nov 9, 2023

Thank you for taking care of this.
The modifications look good to me. Are the tests passing locally on your side? I am facing the same issue as the one reported by the CI on my local computer: apparently, the results returned by RFLU are wrong if we update the values of the factorized matrix A internally. I guess this is expected, according to your comment.

A workaround might be calling the analysis phase again right-after calling the function rf_refactor!. Do you think that would work?

@frapac
Copy link
Member

frapac commented Nov 9, 2023

I also did a quick benchmark on the instances we are interested in :

using LinearAlgebra
using CUDA
using CUDA.CUSPARSE

using MatrixMarket
using CUSOLVERRF

Gx = mmread("Gx_case9241pegase.mtx")
Gu = mmread("Gu_case9241pegase.mtx")
nx = size(Gx, 1)

Gd = CuSparseMatrixCSR(Gx)

nrhs = 256
X = zeros(nx, nrhs)
X .= Gu[:, 1:nrhs]

Xd = CuMatrix(X)
Yd = CUDA.zeros(Float64, nx, nrhs)

rf = CUSOLVERRF.RFLU(Gd; symbolic=:KLU, nrhs=nrhs)
CUDA.@time ldiv!(Yd, rf, Xd)
CUDA.@time ldiv!(Yd, rf', Xd)

Using CUDA runtime 11.8, we get:

  0.011515 seconds (121 CPU allocations: 3.078 KiB) (2 GPU allocations: 2.109 KiB, 0.18% memmgmt time)                                                                                                                                         
  0.013541 seconds (122 CPU allocations: 3.359 KiB) (2 GPU allocations: 2.109 KiB, 0.18% memmgmt time)   

Using CUDA runtime 12.3, we have with this PR:

  0.034388 seconds (112 CPU allocations: 2.828 KiB) (2 GPU allocations: 2.109 KiB, 0.04% memmgmt time)
  0.036004 seconds (113 CPU allocations: 3.125 KiB) (2 GPU allocations: 2.109 KiB, 0.06% memmgmt time)

so we should expect a factor of 3 in the performance using the new CUDA API.

@amontoison
Copy link
Member Author

If we call again the analysis phase after the function rf_refactor! it will work.
Do you suggest that we should just update SpSV for now?

@amontoison
Copy link
Member Author

amontoison commented Nov 9, 2023

I also did a quick benchmark on the instances we are interested in :

using LinearAlgebra
using CUDA
using CUDA.CUSPARSE

using MatrixMarket
using CUSOLVERRF

Gx = mmread("Gx_case9241pegase.mtx")
Gu = mmread("Gu_case9241pegase.mtx")
nx = size(Gx, 1)

Gd = CuSparseMatrixCSR(Gx)

nrhs = 256
X = zeros(nx, nrhs)
X .= Gu[:, 1:nrhs]

Xd = CuMatrix(X)
Yd = CUDA.zeros(Float64, nx, nrhs)

rf = CUSOLVERRF.RFLU(Gd; symbolic=:KLU, nrhs=nrhs)
CUDA.@time ldiv!(Yd, rf, Xd)
CUDA.@time ldiv!(Yd, rf', Xd)

Using CUDA runtime 11.8, we get:

  0.011515 seconds (121 CPU allocations: 3.078 KiB) (2 GPU allocations: 2.109 KiB, 0.18% memmgmt time)                                                                                                                                         
  0.013541 seconds (122 CPU allocations: 3.359 KiB) (2 GPU allocations: 2.109 KiB, 0.18% memmgmt time)   

Using CUDA runtime 12.3, we have with this PR:

  0.034388 seconds (112 CPU allocations: 2.828 KiB) (2 GPU allocations: 2.109 KiB, 0.04% memmgmt time)
  0.036004 seconds (113 CPU allocations: 3.125 KiB) (2 GPU allocations: 2.109 KiB, 0.06% memmgmt time)

so we should expect a factor of 3 in the performance using the new CUDA API.

Did you permute the versions 11.8 and 12.3 François or is it really 3 times slower with the new API?

@frapac
Copy link
Member

frapac commented Nov 9, 2023

If we call again the analysis phase after the function rf_refactor! it will work.
Do you suggest that we should just update SpSV for now?

I would suggest to do that, till NVIDIA provides a function to update the SpSM descriptor. Would be ok doing that?

Did you permute the versions 11.8 and 12.3 François or is it really 3 times slower with the new API?

I just permuted the versions 11.8 and 12.3, nothing to do with this PR: if I detail the timing, I observe that the new function cusparseSpSM_solve is indeed 3x slower than the previous one.

@amontoison
Copy link
Member Author

amontoison commented Nov 9, 2023

If we call again the analysis phase after the function rf_refactor! it will work.
Do you suggest that we should just update SpSV for now?

I would suggest to do that, till NVIDIA provides a function to update the SpSM descriptor. Would be ok doing that?

Yes, I'm in the train for Argonne right now but I can do that after our meeting this afternoon.

Did you permute the versions 11.8 and 12.3 François or is it really 3 times slower with the new API?

I just permuted the versions 11.8 and 12.3, nothing to do with this PR: if I detail the timing, I observe that the new function cusparseSpSM_solve is indeed 3x slower than the previous one.

I highly suspect that this factor 3 comes for an in-place version of the new API.
We can do more operations in parrallel with an out-of-place version and the new API was probably developed for that.

I suppose that they added an unoptimized in-place version to not break code that was using the old API.
It will explain why it was not added before CUDA 12.0.

@frapac
Copy link
Member

frapac commented Nov 9, 2023

Sure, I don't think we are in a rush to solve this issue.

I suppose that they added an unoptimized in-place version to not break code that was using the old API. It will explain why it was not added before CUDA 12.0.

Interesting, that would make sense indeed. That would advocate for developing an out-of-place triangular solve in CUSOLVERRF.

@amontoison
Copy link
Member Author

@frapac
I checked tonight and we have the same performance if the operation is in-place of not with the new API.
I forgot that we allocate a buffer so in-place or not means nothing in that case we can do the same number operations in parrallel.
I also tested on a bigger problem and we don't the factor three like you observed.

using LinearAlgebra
using CUDA
using CUDA.CUSPARSE

using MatrixMarket
using CUSOLVERRF

Gx = mmread("Gx_case_ACTIVSg70k.mtx")
Gu = mmread("Gu_case_ACTIVSg70k.mtx")
nx = size(Gx, 1)

Gd = CuSparseMatrixCSR(Gx)

nrhs = 256
X = zeros(nx, nrhs)
X .= Gu[:, 1:nrhs]

Xd = CuMatrix(X)
Yd = CUDA.zeros(Float64, nx, nrhs)
CUDA.@profile ldiv!(Xd, UpperTriangular(Gd), Yd)

CUDA v11.8:

Profiler ran for 3.52 ms, capturing 130 events.

Host-side activity: calling CUDA APIs took 1.53 ms (43.34% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   22.68% │ 798.94 µs │    11 │  72.63 µs │   6.68 µs │  418.9 µs │ cudaMemcpyAsync         │
│   13.12% │ 462.29 µs │     1 │ 462.29 µs │ 462.29 µs │ 462.29 µs │ cuMemcpyDtoDAsync       │
│    2.47% │  87.02 µs │    12 │   7.25 µs │   4.53 µs │  16.21 µs │ cudaLaunchKernel        │
│    1.94% │  68.43 µs │    11 │   6.22 µs │    3.1 µs │  24.08 µs │ cudaMemsetAsync         │
│    0.91% │  32.19 µs │     1 │  32.19 µs │  32.19 µs │  32.19 µs │ cuMemAllocFromPoolAsync │
│    0.54% │  19.07 µs │     8 │   2.38 µs │   1.91 µs │   5.48 µs │ cudaStreamSynchronize   │
│    0.44% │   15.5 µs │     3 │   5.17 µs │   1.91 µs │   7.63 µs │ cudaMallocAsync         │
│    0.30% │  10.73 µs │     1 │  10.73 µs │  10.73 µs │  10.73 µs │ cuMemFreeAsync          │
│    0.27% │   9.54 µs │     3 │   3.18 µs │   2.15 µs │   5.25 µs │ cudaFree                │
│    0.20% │   6.91 µs │     1 │   6.91 µs │   6.91 µs │   6.91 µs │ cudaDeviceSynchronize   │
│    0.17% │   5.96 µs │    16 │ 372.53 ns │    0.0 ns │   2.62 µs │ cudaGetLastError        │
│    0.01% │ 476.84 ns │     2 │ 238.42 ns │    0.0 ns │ 476.84 ns │ cuDeviceGet             │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 2.82 ms (80.08% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                               │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   36.41% │   1.28 ms │     1 │   1.28 ms │   1.28 ms │   1.28 ms │ _Z46csrsm2_solve_upper_nontranspose_byLevel_kernelIdLi7EEviiiPKT_PKiS4_PS0_iijPiS6_S2_S0_iS6_S6_i                  │
│   19.10% │ 672.82 µs │     2 │ 336.41 µs │ 333.55 µs │ 339.27 µs │ _Z45cusparse_transpose_readWrite_alignment_kernelIdLi1ELb0ELi6ELi4ELi4EEv23cusparseTransposeParamsIT_EPKS1_PS1_S4_ │
│    9.89% │ 348.57 µs │     2 │ 174.28 µs │   3.81 µs │ 344.75 µs │ [copy device to device memory]                                                                                     │
│    8.59% │ 302.55 µs │     6 │  50.43 µs │  41.72 µs │  54.84 µs │ _Z32stable_sort_by_key_domino_phase1ILi256ELi4EEviiiPiS0_S0_S0_S0_S0_                                              │
│    4.80% │ 169.28 µs │     1 │ 169.28 µs │ 169.28 µs │ 169.28 µs │ _Z41csrsm2_analysis_upper_nontranspose_kernelILi5ELi3EEviPKiS1_PiiS2_S2_S2_i                                       │
│    0.63% │  22.17 µs │    11 │   2.02 µs │   1.43 µs │   4.05 µs │ [set device memory]                                                                                                │
│    0.36% │  12.64 µs │     8 │   1.58 µs │   1.43 µs │   1.67 µs │ [copy device to pageable memory]                                                                                   │
│    0.11% │   3.81 µs │     1 │   3.81 µs │   3.81 µs │   3.81 µs │ _Z10set_identyILi128EEviPi                                                                                         │
│    0.11% │   3.81 µs │     1 │   3.81 µs │   3.81 µs │   3.81 µs │ _Z28stable_sort_by_key_stop_coreILi256ELi4EEviPiS0_                                                                │
│    0.08% │   2.86 µs │     2 │   1.43 µs │   1.43 µs │   1.43 µs │ [copy pageable to device memory]                                                                                   │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

CUDA v12.3:

Profiler ran for 3.39 ms, capturing 125 events.

Host-side activity: calling CUDA APIs took 715.26 µs (21.08% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                   │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────────────┤
│   10.70% │ 363.11 µs │     4 │  90.78 µs │   4.05 µs │ 347.61 µs │ cudaMemsetAsync                                        │
│    5.54% │ 187.87 µs │     1 │ 187.87 µs │ 187.87 µs │ 187.87 µs │ cudaMemcpyAsync                                        │
│    2.64% │  89.65 µs │    12 │   7.47 µs │   4.53 µs │  16.93 µs │ cudaLaunchKernel                                       │
│    1.03% │  34.81 µs │     1 │  34.81 µs │  34.81 µs │  34.81 µs │ cuMemAllocFromPoolAsync                                │
│    0.27% │    9.3 µs │     1 │    9.3 µs │    9.3 µs │    9.3 µs │ cudaLaunchCooperativeKernel                            │
│    0.22% │   7.63 µs │     1 │   7.63 µs │   7.63 µs │   7.63 µs │ cuMemFreeAsync                                         │
│    0.16% │   5.48 µs │    48 │ 114.24 ns │    0.0 ns │ 476.84 ns │ cudaGetLastError                                       │
│    0.12% │   4.05 µs │     2 │   2.03 µs │ 476.84 ns │   3.58 µs │ cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.11% │   3.81 µs │     8 │ 476.84 ns │ 238.42 ns │   2.15 µs │ cudaGetDevice                                          │
│    0.08% │   2.86 µs │     2 │   1.43 µs │ 238.42 ns │   2.62 µs │ cudaDeviceGetAttribute                                 │
│    0.06% │   1.91 µs │     8 │ 238.42 ns │    0.0 ns │ 476.84 ns │ cudaPeekAtLastError                                    │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 2.83 ms (83.33% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                                                                                                                                  ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   54.52% │   1.85 ms │     1 │   1.85 ms │   1.85 ms │   1.85 ms │ _ZN8cusparse14spsm_v1_kernelIidLi8ELi1ELi4ELi16ELi32EEEvT_S1_llliPT0_PKS2_PKS1_S7_S5_S5_PjS8_S8_NS_12KernelCoeffsIS2_EEijj            ⋯
│   20.81% │  706.2 µs │     2 │  353.1 µs │ 333.55 µs │ 372.65 µs │ _Z45cusparse_transpose_readWrite_alignment_kernelIdiLi1ELb0ELi6ELi4ELi4EEv23cusparseTransposeParamsIT_EPKS1_PS1_S4_                   ⋯
│    5.19% │ 175.95 µs │     1 │ 175.95 µs │ 175.95 µs │ 175.95 µs │ _ZN8cusparse18find_colors_kernelIiLi256ELi4ELb0EEEvT_PKS1_S3_PS1_S4_PjS5_S5_S5_S5_S5_i18cusparseFillMode_t                            ⋯
│    1.10% │  37.19 µs │     1 │  37.19 µs │  37.19 µs │  37.19 µs │ _ZN8cusparse21create_perm_nt_kernelIidLi128ELi16EEEvT_PKS1_S3_PKT0_PjPS1_S8_PS4_S9_i18cusparseFillMode_t18cusparseDiagType_tb         ⋯
│    0.47% │  15.97 µs │     1 │  15.97 µs │  15.97 µs │  15.97 µs │ _ZN8cusparse28do_triangular_pattern_kernelIiLi128ELi4EEEvT_PKS1_S3_PS1_S4_i18cusparseFillMode_t                                       ⋯
│    0.41% │  13.83 µs │     1 │  13.83 µs │  13.83 µs │  13.83 µs │ _ZN8cusparse21number_of_deps_kernelIiLi128ELb0ELi4EEEvT_PKS1_S3_PjPS1_i18cusparseFillMode_t                                           ⋯
│    0.29% │   9.78 µs │     2 │   4.89 µs │   4.77 µs │   5.01 µs │ _ZN3cub53CUB_200101_500_520_600_610_700_750_800_860_890_900_NS16DeviceScanKernelINS0_16DeviceScanPolicyIiE9Policy600EPKiPiNS0_13ScanT ⋯
│    0.20% │   6.91 µs │     4 │   1.73 µs │ 953.67 ns │   2.38 µs │ [set device memory]                                                                                                                   ⋯
│    0.11% │   3.81 µs │     1 │   3.81 µs │   3.81 µs │   3.81 µs │ _ZN8cusparse23gather_row_color_kernelIiLi128EEEvT_PjS2_PS1_                                                                           ⋯
│    0.10% │   3.34 µs │     2 │   1.67 µs │   1.67 µs │   1.67 µs │ _ZN3cub53CUB_200101_500_520_600_610_700_750_800_860_890_900_NS20DeviceScanInitKernelINS0_13ScanTileStateIiLb1EEEEEvT_i                ⋯
│    0.08% │   2.62 µs │     1 │   2.62 µs │   2.62 µs │   2.62 µs │ _ZN8cusparse16streamcpy_kernelIiLi128EEEvT_PS1_S2_S1_                                                                                 ⋯
│    0.06% │   1.91 µs │     1 │   1.91 µs │   1.91 µs │   1.91 µs │ [copy device to pageable memory]                                                                                                      ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Copy link
Member

@frapac frapac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Before merging this, we need to discuss where to update the analysis for CuSparseSM. I am not sure if it is better to do that in src/backsolve.jl (as implemented right now) or directly in src/interface.jl when calling the function lu! (see comment below).

CUSPARSE.cusparseSpSV_analysis(
CUSPARSE.handle(), transa, Ref{T}(alpha), descL, descX, descX, T, algo, spsv_L, buffer_L,
)
CUSPARSE.cusparseSpSV_analysis(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand why we needs two buffers. Are buffer_L and buffer_U used implicitly in cusparseSpSV_solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are used implicitly by cusparseSpSV_solve. cusparseSpSV_analysis updates the pointer of the buffer in the SpSV descriptor.

src/backsolve.jl Outdated
chktrans(transa)

# CUDA 12 is required for inplace computation in cusparseSpSM
@assert CUDA.runtime_version() ≥ v"12.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update to v"12.3" as this is required after?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/backsolve.jl Outdated
[(s.descU, s.infoU), (s.descL, s.infoL)]
end

operations = s.transa == 'N' ? [(s.descL, s.infoL), (s.descU, s.infoU)] : [(s.descU, s.infoU), (s.descL, s.infoL)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the interface remains the same as before for CuSparseSM. How about using cusparseSpSM_solve in that function, and updating the analysis apart in the interface RFLU?
https://github.com/exanauts/CUSOLVERRF.jl/blob/master/src/interface.jl#L85

Copy link
Member Author

@amontoison amontoison Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you call multiple time SpSM without updating the rf structure?
In that case we can reuse the buffer but otherwise we will gain nothing to move the computation of the SpSM analyis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we call SpSM multiple times inside a for loop (on large instances we cannot instantiate a buffer large enough ).

@amontoison
Copy link
Member Author

amontoison commented Nov 14, 2023

I got an answer from NVIDIA:
Yes. The cusparseSpSV_updateMatrix was added to support the HPCG benchmark. We plan to support the "updateMatrix" routine for SpSM in future releases (probably CUDA 12.4.x).

@frapac
Copy link
Member

frapac commented Nov 14, 2023

This PR looks good to merge once we have addressed the update of SpSM once the matrix has been refactorized after a change in the coefficients.

@amontoison
Copy link
Member Author

It should be good now François.

Copy link
Member

@frapac frapac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@frapac frapac merged commit d406df3 into exanauts:master Nov 15, 2023
2 checks passed
@amontoison amontoison deleted the spsv_spsm branch November 30, 2023 09:14
@amontoison
Copy link
Member Author

@frapac

Hi Alexis , We are glad to let you know our CUSPARSE engineering team has implemented this request of cusparseSpSM_updateMatrix . I've verified it initially on our testframe. This will target release CUDA 12.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression from CUDA 11.8 to CUDA 12.1
2 participants