Check cuda - torch.cuda.get_device_capability #102

mgt16-LANL · 2024-09-12T22:23:44Z

Only run triton check if cuda is available. This change fixes minor bug with recent changes that broke our cpu-only running of hippynn.

1. Replace all occurrences of `restore_db` with `restart_db`. A note is added to the documentation to reflect this change. 2. Update changelog for breaking changes of `restart_db` and `make_trainvalidtest_split`.

WTH was I typing?

Only run triton check if cuda is available.

tautomer · 2024-09-13T00:29:59Z

I had to use HIPPYNN_USE_CUSTOM_KERNEL=false to compile the docs.

I initially added this in my PR as well, but I deleted it at the end. Actually, the cupy part also breaks the code on Chicoma head node (I was just compiling the docs, but CPU training will be the same on the head node).

I removed my changes mainly because of the logic of the custom kernel part. "Auto" basically means "true" which in turn implies GPU must be available. If we follow this logic, the right thing to do is to set HIPPYNN_USE_CUSTOM_KERNEL=false.

IMO, "auto" should be equal to "false" for CPU case, so that these problems will not show up at all. Not sure if Nick agrees with this.

lubbersnick · 2024-09-13T16:22:41Z

@tautomer I don't understand what you mean about cupy breaking too.

USE_CUSTOM_KERNELS=True is important for CPU too because there is a numba implementation on the CPU which strongly outperforms pure pytorch.

It is true that 'cupy' and 'triton' options should not be available unless there is a GPU.

…andling

lubbersnick · 2024-09-13T20:23:50Z

I did probably too much to fix this, but it is fixed now. Also merged changes from #101. And documentation updates! And settings functionality updates! I tested this on a machine without cuda in pytorch, on a machine with cuda in pytorch, but no GPU, a machine on with an old, non-triton-compatible GPU, and a machine with a newer, triton-compatible GPU.

tautomer · 2024-09-16T19:15:44Z

@tautomer I don't understand what you mean about cupy breaking too.

USE_CUSTOM_KERNELS=True is important for CPU too because there is a numba implementation on the CPU which strongly outperforms pure pytorch.

It is true that 'cupy' and 'triton' options should not be available unless there is a GPU.

I cannot reproduce this error anymore on Chicoma, but I found the error message from my clipboard

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

I do not know what setup can lead to this error, but very likely an edge case.

I added an extra try/except for this error

    try:
        if not cupy.cuda.is_available():
            if torch.cuda.is_available():
                warnings.warn("cupy.cuda.is_available() returned False: Custom kernels will fail on GPU tensors.")
    except RuntimeError as e:
        warnings.warn(f"Cupy encountered a RuntimeError with the message of {e}")

Anyway, this error has gone away. Not sure if they changed something during the current DST that fixed the error.

lubbersnick · 2024-09-16T21:23:49Z

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

This is when trying to use cuda 12 for cudatoolkit when the nvidia driver is not up to date for it. The DST did indeed update the cuda driver.

I think everything is OK in hippynn. If someone is trying to use cupy and they get cupy errors, there is probably a limit to the kinds of edge cases we can cover.

tautomer · 2024-09-16T21:51:24Z

cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

This is when trying to use cuda 12 for cudatoolkit when the nvidia driver is not up to date for it. The DST did indeed update the cuda driver.

I think everything is OK in hippynn. If someone is trying to use cupy and they get cupy errors, there is probably a limit to the kinds of edge cases we can cover.

I see. This makes sense.

I guess we can build a Q&A or Wiki section to cover this kind of issues. People may encounter similar problems but cannot find a clue.

tautomer and others added 3 commits September 12, 2024 11:49

Update docs and changelog

e1e75d5

1. Replace all occurrences of `restore_db` with `restart_db`. A note is added to the documentation to reflect this change. 2. Update changelog for breaking changes of `restart_db` and `make_trainvalidtest_split`.

Fix typos and rephrase the note

3342468

WTH was I typing?

Check cuda - torch.cuda.get_device_capability

29ffaa2

Only run triton check if cuda is available.

lubbersnick added 3 commits September 13, 2024 13:57

update a lot of settings and doc releated things. fix custom kernel h…

21bd876

…andling

Merge remote-tracking branch 'tautomer/changelog'

50346f1

update changelog, revert ipynb figure changes

6458c7b

lubbersnick merged commit 144c160 into lanl:development Sep 13, 2024
1 check passed

lubbersnick mentioned this pull request Sep 13, 2024

Update docs and changelog #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check cuda - torch.cuda.get_device_capability #102

Check cuda - torch.cuda.get_device_capability #102

mgt16-LANL commented Sep 12, 2024

tautomer commented Sep 13, 2024 •

edited

Loading

lubbersnick commented Sep 13, 2024

lubbersnick commented Sep 13, 2024

tautomer commented Sep 16, 2024

lubbersnick commented Sep 16, 2024

tautomer commented Sep 16, 2024

Check cuda - torch.cuda.get_device_capability #102

Check cuda - torch.cuda.get_device_capability #102

Conversation

mgt16-LANL commented Sep 12, 2024

tautomer commented Sep 13, 2024 • edited Loading

lubbersnick commented Sep 13, 2024

lubbersnick commented Sep 13, 2024

tautomer commented Sep 16, 2024

lubbersnick commented Sep 16, 2024

tautomer commented Sep 16, 2024

tautomer commented Sep 13, 2024 •

edited

Loading