Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check cuda - torch.cuda.get_device_capability #102

Merged
merged 6 commits into from
Sep 13, 2024

Conversation

mgt16-LANL
Copy link
Contributor

Only run triton check if cuda is available. This change fixes minor bug with recent changes that broke our cpu-only running of hippynn.

tautomer and others added 3 commits September 12, 2024 11:49
1. Replace all occurrences of `restore_db` with `restart_db`. A note
   is added to the documentation to reflect this change.
2. Update changelog for breaking changes of `restart_db` and
   `make_trainvalidtest_split`.
Only run triton check if cuda is available.
@tautomer
Copy link
Collaborator

tautomer commented Sep 13, 2024

I had to use HIPPYNN_USE_CUSTOM_KERNEL=false to compile the docs.

I initially added this in my PR as well, but I deleted it at the end. Actually, the cupy part also breaks the code on Chicoma head node (I was just compiling the docs, but CPU training will be the same on the head node).

I removed my changes mainly because of the logic of the custom kernel part. "Auto" basically means "true" which in turn implies GPU must be available. If we follow this logic, the right thing to do is to set HIPPYNN_USE_CUSTOM_KERNEL=false.

IMO, "auto" should be equal to "false" for CPU case, so that these problems will not show up at all. Not sure if Nick agrees with this.

@lubbersnick
Copy link
Collaborator

@tautomer I don't understand what you mean about cupy breaking too.

USE_CUSTOM_KERNELS=True is important for CPU too because there is a numba implementation on the CPU which strongly outperforms pure pytorch.

It is true that 'cupy' and 'triton' options should not be available unless there is a GPU.

@lubbersnick
Copy link
Collaborator

I did probably too much to fix this, but it is fixed now. Also merged changes from #101. And documentation updates! And settings functionality updates! I tested this on a machine without cuda in pytorch, on a machine with cuda in pytorch, but no GPU, a machine on with an old, non-triton-compatible GPU, and a machine with a newer, triton-compatible GPU.

@lubbersnick lubbersnick merged commit 144c160 into lanl:development Sep 13, 2024
1 check passed
@tautomer
Copy link
Collaborator

@tautomer I don't understand what you mean about cupy breaking too.

USE_CUSTOM_KERNELS=True is important for CPU too because there is a numba implementation on the CPU which strongly outperforms pure pytorch.

It is true that 'cupy' and 'triton' options should not be available unless there is a GPU.

I cannot reproduce this error anymore on Chicoma, but I found the error message from my clipboard

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

I do not know what setup can lead to this error, but very likely an edge case.

I added an extra try/except for this error

    try:
        if not cupy.cuda.is_available():
            if torch.cuda.is_available():
                warnings.warn("cupy.cuda.is_available() returned False: Custom kernels will fail on GPU tensors.")
    except RuntimeError as e:
        warnings.warn(f"Cupy encountered a RuntimeError with the message of {e}")

Anyway, this error has gone away. Not sure if they changed something during the current DST that fixed the error.

@lubbersnick
Copy link
Collaborator

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

This is when trying to use cuda 12 for cudatoolkit when the nvidia driver is not up to date for it. The DST did indeed update the cuda driver.

I think everything is OK in hippynn. If someone is trying to use cupy and they get cupy errors, there is probably a limit to the kinds of edge cases we can cover.

@tautomer
Copy link
Collaborator

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

This is when trying to use cuda 12 for cudatoolkit when the nvidia driver is not up to date for it. The DST did indeed update the cuda driver.

I think everything is OK in hippynn. If someone is trying to use cupy and they get cupy errors, there is probably a limit to the kinds of edge cases we can cover.

I see. This makes sense.

I guess we can build a Q&A or Wiki section to cover this kind of issues. People may encounter similar problems but cannot find a clue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants