Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUDA versions for CI #6539

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Update CUDA versions for CI #6539

wants to merge 1 commit into from

Conversation

StrikerRUS
Copy link
Collaborator

Fixed #6520.

@StrikerRUS
Copy link
Collaborator Author

@shiyu1994 Hi! May I kindly ask you to update NVIDIA drivers at the host machine where CUDA CI jobs are executed? It will allow us to run tests against the most recent CUDA version 12.5. The current installed driver is 525.147.05 which is insufficient to run CUDA 12.5 containers:

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.5, please update your driver to a newer version, or use an earlier cuda container: unknown

Refer to #6520 for the context of this PR.

Some related external links:

@jameslamb
Copy link
Collaborator

Based on https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers, I think we want R535 (the latest long-term support release).

@StrikerRUS
Copy link
Collaborator Author

I think we want R535 (the latest long-term support release).

Agree.

* Based on my personal experience, R530 driver doesn't support CUDA 12.5.

@StrikerRUS
Copy link
Collaborator Author

Gently ping @shiyu1994 for fresh NVIDIA driver installation.

@StrikerRUS
Copy link
Collaborator Author

Can confirm that R535 is enough to run containers with CUDA 12.5.
Host:

Tue Aug  6 22:16:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   27C    P8              27W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
docker run --rm --gpus all nvcr.io/nvidia/cuda:12.5.1-cudnn-devel-ubuntu20.04 nvidia-smi
==========
== CUDA ==
==========

CUDA Version 12.5.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Aug  6 22:14:16 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:8D:00.0 Off |                  Off |
| 30%   28C    P8              28W / 300W |  24893MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@jameslamb
Copy link
Collaborator

I'll try to contact @shiyu1994 in the maintainer Slack.

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Did you succeed? 👼

@jameslamb
Copy link
Collaborator

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

@StrikerRUS
Copy link
Collaborator Author

Just learned that CUDA Forward Compatibility feature is available only for server cards (e.g. Tesla A100) and not for domestic ones (e.g. RTX 4090).

Forward Compatibility is applicable only for systems with NVIDIA Data Center GPUs or select NGC Server Ready SKUs of RTX cards.

For example, on domestic card RTX 4090 with R535 driver you'll get cuda runtime error (804) : forward compatibility was attempted on non supported HW while trying to run Docker image with CUDA 12.4.

@shiyu1994
Copy link
Collaborator

@jameslamb Did you succeed? 👼

No, I haven't been able to reach @shiyu1994 in the last 2 months.

@shiyu1994 since I do see you're active here (#6623), could you please help us with this? I sent another message in the maintainer private chat as well on a separate topic.

Sorry I cannot login to my slack account, since it is registered with a @qq.com email. I will update the CUDA version of the CI agent.

@jameslamb
Copy link
Collaborator

Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Sync supported CUDA versions with a new support policy for CUDA Container Images
3 participants