Update CUDA versions for CI #6539

StrikerRUS · 2024-07-13T16:58:38Z

StrikerRUS · 2024-07-13T21:28:07Z

@shiyu1994 Hi! May I kindly ask you to update NVIDIA drivers at the host machine where CUDA CI jobs are executed? It will allow us to run tests against the most recent CUDA version 12.5. The current installed driver is 525.147.05 which is insufficient to run CUDA 12.5 containers:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

Refer to #6520 for the context of this PR.

Some related external links:

jameslamb · 2024-07-13T22:38:05Z

Based on https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers, I think we want R535 (the latest long-term support release).

StrikerRUS · 2024-07-14T12:29:46Z

I think we want R535 (the latest long-term support release).

Agree.

* Based on my personal experience, R530 driver doesn't support CUDA 12.5.

StrikerRUS · 2024-07-24T13:42:23Z

Gently ping @shiyu1994 for fresh NVIDIA driver installation.

StrikerRUS · 2024-08-06T22:19:38Z

Can confirm that R535 is enough to run containers with CUDA 12.5.
Host:

Tue Aug  6 22:16:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   27C    P8              27W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

docker run --rm --gpus all nvcr.io/nvidia/cuda:12.5.1-cudnn-devel-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Aug  6 22:14:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   28C    P8              28W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

jameslamb · 2024-08-07T03:54:36Z

I'll try to contact @shiyu1994 in the maintainer Slack.

StrikerRUS · 2024-09-05T13:26:03Z

@jameslamb Did you succeed? 👼

jameslamb · 2024-09-06T01:06:12Z

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

StrikerRUS · 2024-09-12T15:54:07Z

Just learned that CUDA Forward Compatibility feature is available only for server cards (e.g. Tesla A100) and not for domestic ones (e.g. RTX 4090).

Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select NGC Server Ready SKUs of RTX cards.

For example, on domestic card RTX 4090 with R535 driver you'll get cuda runtime error (804) : forward compatibility was attempted on non supported HW while trying to run Docker image with CUDA 12.4.

shiyu1994 · 2024-09-19T02:00:47Z

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

Sorry I cannot login to my slack account, since it is registered with a @qq.com email. I will update the CUDA version of the CI agent.

jameslamb · 2024-09-19T18:38:10Z

Thank you!!

Update CUDA versions for CI

ef8e09d

StrikerRUS added the maintenance label Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CUDA versions for CI #6539

Update CUDA versions for CI #6539

StrikerRUS commented Jul 13, 2024

StrikerRUS commented Jul 13, 2024

jameslamb commented Jul 13, 2024

StrikerRUS commented Jul 14, 2024

StrikerRUS commented Jul 24, 2024

StrikerRUS commented Aug 6, 2024

jameslamb commented Aug 7, 2024

StrikerRUS commented Sep 5, 2024

jameslamb commented Sep 6, 2024

StrikerRUS commented Sep 12, 2024

shiyu1994 commented Sep 19, 2024

jameslamb commented Sep 19, 2024

Update CUDA versions for CI #6539

Are you sure you want to change the base?

Update CUDA versions for CI #6539

Conversation

StrikerRUS commented Jul 13, 2024

StrikerRUS commented Jul 13, 2024

jameslamb commented Jul 13, 2024

StrikerRUS commented Jul 14, 2024

StrikerRUS commented Jul 24, 2024

StrikerRUS commented Aug 6, 2024

jameslamb commented Aug 7, 2024

StrikerRUS commented Sep 5, 2024

jameslamb commented Sep 6, 2024

StrikerRUS commented Sep 12, 2024

shiyu1994 commented Sep 19, 2024

jameslamb commented Sep 19, 2024