[CI] Use cuda 12.1 docker image again. #14179

JackAKirk · 2024-06-14T10:53:51Z

Updating the docker to 12.5 led to the below described problems. Since the testing output of 12.1 matches 12.5, and we don't actually use any cuda features later than 12.1 (which are minor updates) in the compiler, this PR reverts back to the 12.1 image.
We can update the docker later only when we really need to (probably when cuda 13 is released). For the purposes of intel/llvm CI 12.1 is sufficient.
This fixes the "latest" docker image, allowing other updates to the docker image to be made in the future.

CUDA docker issues:

Depending on the host setup of the runners, there are various issues on recent nvidia docker images related to interactions with the host, whereby nvidia devices are not visible.

Signed-off-by: JackAKirk <[email protected]>

uditagarwal97 · 2024-06-14T13:27:25Z

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

JackAKirk · 2024-06-14T13:31:06Z

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.

In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest.
But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

uditagarwal97 · 2024-06-14T14:03:38Z

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

JackAKirk · 2024-06-14T14:06:21Z

Yes in a normal machine cuda driver versions work with older toolkits. When it comes to docker, things can be complicated. Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

uditagarwal97 · 2024-06-14T14:17:58Z

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

JackAKirk · 2024-06-14T14:20:35Z

In that case then I think it would be to revert any recent changes made to the cuda runners.

I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

uditagarwal97 · 2024-06-14T14:32:04Z

Yes, the PR LGTM overall.
My only concern is that is there someone from CodePlay, who is looking into the infrastructure issue with our self-hosted CUDA runner? It's been several days since we disabled CUDA tests from Nightly...

JackAKirk · 2024-06-14T14:35:57Z

afaik none of us have access to these runners. Did the problems begin when you changed the driver to 12.4? If that is the case, then I suggest undoing this.

uditagarwal97 · 2024-06-14T14:44:56Z

@stdale-intel Can we share access to our self-hosted CUDA runner?

Yes. Unfortunately, I can't roll back the driver because of incompatibility of the older driver with the Linux headers on this machine. (see #14041 (comment))

JackAKirk · 2024-06-14T14:53:45Z

I see. Until we have access to these machines, I'm not sure how we can help with this a great deal.

Use cuda 12.1 docker image.

f1dc153

Signed-off-by: JackAKirk <[email protected]>

JackAKirk requested a review from a team as a code owner June 14, 2024 10:53

uditagarwal97 approved these changes Jun 14, 2024

View reviewed changes

aelovikov-intel merged commit f43e8c4 into intel:sycl Jun 14, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Use cuda 12.1 docker image again. #14179

[CI] Use cuda 12.1 docker image again. #14179

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

[CI] Use cuda 12.1 docker image again. #14179

[CI] Use cuda 12.1 docker image again. #14179

Conversation

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024

uditagarwal97 commented Jun 14, 2024

JackAKirk commented Jun 14, 2024