Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation of the Library #54

Open
i3s93 opened this issue Jul 25, 2024 · 23 comments
Open

Installation of the Library #54

i3s93 opened this issue Jul 25, 2024 · 23 comments

Comments

@i3s93
Copy link

i3s93 commented Jul 25, 2024

I would like to use tools from this library in one of my projects, but I'm having some difficulties with the installation process on a Linux cluster.

I have extracted and set the library path to the shared object files for cuDSS following the directions given here. After installing CUDSS.jl, I tried to execute the following test:

using CUDA, CUDA.CUSPARSE, CUDSS, LinearAlgebra, SparseArrays
A = CuSparseMatrixCSR(sprand(100, 100, 0.1))
solver = CudssSolver(A, "G", 'F')

On the third line, I receive the following error message:

ERROR: UndefVarError: `libcudss` not defined

I'm not sure what I am doing wrong. I have also tried setting the environment variable JULIA_CUDSS_LIBRARY_PATH which is used to set the path for libcudss. Something is not being set properly. I'm using CUDA.jl (v5.4.3) and CUDSS.jl (v0.3.1) on Julia v1.9, if that helps.

@amontoison
Copy link
Member

amontoison commented Jul 25, 2024

@i3s93 You don't need to install anything related to the source code of cuDSS.
We have an artifact system that allow to download and install cuDSS for the users automatically (CUDSS_jll.jl).

You just need

julia> ]
pkg> add CUDSS

It's explained in the README.md but I should add a note that it also installs the shared library.
You should be able to run any Julia example after that.

@i3s93
Copy link
Author

i3s93 commented Jul 25, 2024

Thank you @amontoison for your rapid response. I actually started with the base installation in the README.md, but encountered the same error message. That is why I tried to manually set the path, but neither approach worked for me. Here is what I see on my end when I execute the code from my previous comment:

ERROR: UndefVarError: `libcudss` not defined
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:218 [inlined]
  [2] macro expansion
    @ ~/.julia/packages/CUDSS/2E89a/src/libcudss.jl:245 [inlined]
  [3] #31
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:35 [inlined]
  [4] retry_reclaim(f::CUDSS.var"#31#32"{Base.RefValue{Ptr{CUDSS.cudssMatrix}}, Int64, Int64, Int32, CuArray{Int32, 1, CUDA.DeviceMemory}, CuPtr{Nothing}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, DataType, DataType, String, Char, Char}, retry_if::CUDSS.var"#retry_if#49")
    @ CUDA ~/.julia/packages/CUDA/Tl08O/src/memory.jl:434
  [5] check
    @ ~/.julia/packages/CUDSS/2E89a/src/error.jl:45 [inlined]
  [6] cudssMatrixCreateCsr
    @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
  [7] CudssMatrix(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char; index::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:81
  [8] CudssMatrix
    @ ~/.julia/packages/CUDSS/2E89a/src/helpers.jl:78 [inlined]
  [9] _
    @ ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:40 [inlined]
 [10] CudssSolver(A::CuSparseMatrixCSR{Float64, Int32}, structure::String, view::Char)
    @ CUDSS ~/.julia/packages/CUDSS/2E89a/src/interfaces.jl:39
 [11] top-level scope
    @ REPL[3]:1

@amontoison
Copy link
Member

amontoison commented Jul 25, 2024

Can you remove the environment variable JULIA_CUDSS_LIBRARY_PATH and try to recompile CUDSS.jl with:

force_recompile(package_name::String) = Base.compilecache(Base.identify_package(package_name))
force_recompile("CUDSS")
using CUDSS

@amontoison
Copy link
Member

If it's still not working, what is your NVIDIA GPU and operating system / architecture?

@i3s93
Copy link
Author

i3s93 commented Jul 25, 2024

I tried your solution, but I'm still seeing the same problem. I'm running with an NVIDIA A100 GPU with an AMD EPYC 7763 processor. The operating system is SUSE Linux Enterprise Server 15 SP4.

@amontoison
Copy link
Member

Did you install CUDSS.jl on a node with a GPU initially?
I will try to force Julia to reinstall the artifacts with:

rm -rf ~/.julia/artifacts/*

@amontoison
Copy link
Member

amontoison commented Jul 26, 2024

Can you also display the output of:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=none, cuda_local=false, cxxstring_abi=cxx11, julia_version=1.10.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

On my laptop I don't have an NVIDIA GPU so the shared library of cuDSS is not installed.

Are the NVIDIA drivers installed on your computer?

@i3s93
Copy link
Author

i3s93 commented Jul 26, 2024

Okay, I have removed the artifacts as you have suggested. When I installed the package, I was on a node with the A100. Here is the output you requested:

julia> CUDSS_jll.host_platform
Linux x86_64 {cuda=12.2, cuda_local=true, cxxstring_abi=cxx11, julia_version=1.9.4, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.30}

I still see the same error message.

@i3s93
Copy link
Author

i3s93 commented Jul 26, 2024

Just to follow up, I was able to install and run the code from the package locally on a laptop with an NVIDIA GPU. So far, I have only been able to see this issue when I try to install the package on a remote cluster. I will reach out to the system administrators and see if something on their end is disrupting the installation.

@carstenbauer
Copy link

Are you using a module on the cluster to get Julia? (I.e. module load ...) If so, can you post the output of module show ...?

It seems that you're trying to use a local cuda. Assuming that wasn't your intention and own doing, it might be a global preference that is set when you load a Julia module.

Btw, which cluster is this?

@i3s93
Copy link
Author

i3s93 commented Jul 28, 2024

@carstenbauer: This is on Perlmutter, if that helps. Here is the output of module list

Currently Loaded Modules:
  1) craype-x86-milan     3) craype-network-ofi                      5) PrgEnv-gnu/8.5.0   7) cray-libsci/23.12.5   9) craype/2.7.30    11) perftools-base/23.12.0  13) craype-accel-nvidia80  15) julia/1.9.4
  2) libfabric/1.15.2.0   4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   6) cray-dsmml/0.2.2   8) cray-mpich/8.1.28    10) gcc-native/12.3  12) cpe/23.12               14) gpu/1.0                16) cudatoolkit/12.2 (g)

  Where:
   g:  built for GPU

I can run any of my Julia CUDA codes fine without the CUDA modules, so the CUDA Toolkit is not necessary. I see the same error regardless of whether not this module is loaded.

@carstenbauer
Copy link

carstenbauer commented Jul 29, 2024

@i3s93 I just tested this on Perlmutter.

If I use the julia module (module load julia) I can reproduce your error message.

However, if I

  • unset JULIA_LOAD_PATH (to get rid of the global Julia preferences set by the module)
  • and module unload cudatoolkit (not necessary but better to avoid potential conflicts),

your test above works without any issues in a clean Julia environment that just has CUDA and CUDSS in it.

@JBlaschke
Copy link

The environment in the global JULIA_LOAD_PATH is used to specify the CUDA version (to stop Julia from installing a version of the CUDA runtime that is incompatible with the system) and the MPI configuration. I suspect the later has no effect here.

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

@carstenbauer
Copy link

carstenbauer commented Jul 29, 2024

@i3s93 did unsetting JULIA_LOAD_PATH cause pkg> add CUDSS to install a newer version of CUDA?

@JBlaschke I assume the question was for me, because I was the one that did the (successful) test with unset JULIA_LOAD_PATH. And to answer it, yes, afterwards I get 12.5 (instead of 12.2):

julia> CUDA.versioninfo()
CUDA runtime 12.5, artifact installation
CUDA driver 12.0
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

For comparison, this is if I don't unset and don't unload the cudatoolkit module:

julia> CUDA.versioninfo()
CUDA runtime 12.2, local installation
CUDA driver 12.2
NVIDIA driver 525.105.17

CUDA libraries:
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 2023.2.0 (API 20.0.0)
- NVML: 12.0.0+525.105.17

Julia packages:
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.4

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 38.984 GiB / 40.000 GiB available)

@JBlaschke
Copy link

Thanks @carstenbauer for checking. So libcudss doesn't appear to be in the cudatoolkit module. I'll see if it's installed anywhere.

One more thing: does the artifact even work on a compute? For previous versions we would get segfaults.

@JBlaschke
Copy link

It looks like we don't have a version on Perlmutter yet. I might go and check the artifact install of CUDA. If that doesn't work I'd need to develop a module.

@amontoison
Copy link
Member

@JBlaschke Do you mean the artifact of cuDSS?
The recent version 0.3.0 works fine without segmentation faults.

@JBlaschke
Copy link

@amontoison no I meant running CUDA.jl using the artifact CUDA (instead of the one provided by the OS)

@JBlaschke
Copy link

On Perlmutter

@i3s93
Copy link
Author

i3s93 commented Jul 30, 2024

@carstenbauer Thank you for taking the time to help resolve this issue! I can also confirm that unsetting unset JULIA_LOAD_PATH worked for me.

@JBlaschke Thank you for your help as well! My tests with cuDSS are a small scale, so I am fine with unsetting the environment variable until a better solution becomes available.

@amontoison I greatly appreciate the timely feedback and for having a look at this problem. Since this does not appear to be an issue with CUDSS.jl, I'm fine with closing this issue, unless the others would like to continue the discussion!

@amontoison
Copy link
Member

amontoison commented Jul 30, 2024

Am I wondering how relevant it will be to detect a local installation of cuDSS:
#55

cuDSS is still in preview so every minor release breaks the API, and it requires the local installation to be always the most recent version, which is probably hard to maintain.

@JBlaschke
Copy link

@amontoison in the past CUDA would not work at all unless you used the local install on Perlmutter. It might be the case that this is no longer necessary.

I haven't had a chance to test this. Will do so soon. If it is the case that running CUDA_jll is unstable on Perlmutter, then we have no choice but to also use a local CUDSS install...

@amontoison
Copy link
Member

amontoison commented Aug 15, 2024

@carstenbauer @JBlaschke @i3s93
May I ask one of you to test my PR #57?
It should help to detect a local install on Perlmutter.

Do you know why Tim checks whether precompiling in this function __init__, which I based my PR on?
Is it to avoid an error when precompiling on a cluster node without GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants