Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Use cuda 12.1 docker image again. #14179

Merged
merged 1 commit into from
Jun 14, 2024

Conversation

JackAKirk
Copy link
Contributor

Updating the docker to 12.5 led to the below described problems. Since the testing output of 12.1 matches 12.5, and we don't actually use any cuda features later than 12.1 (which are minor updates) in the compiler, this PR reverts back to the 12.1 image.
We can update the docker later only when we really need to (probably when cuda 13 is released). For the purposes of intel/llvm CI 12.1 is sufficient.
This fixes the "latest" docker image, allowing other updates to the docker image to be made in the future.

CUDA docker issues:

Depending on the host setup of the runners, there are various issues on recent nvidia docker images related to interactions with the host, whereby nvidia devices are not visible.

@JackAKirk JackAKirk requested a review from a team as a code owner June 14, 2024 10:53
@uditagarwal97
Copy link
Contributor

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

@JackAKirk
Copy link
Contributor Author

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.

In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest.
But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

@uditagarwal97
Copy link
Contributor

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.

In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

@JackAKirk
Copy link
Contributor Author

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Yes in a normal machine cuda driver versions work with older toolkits. When it comes to docker, things can be complicated. Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

@uditagarwal97
Copy link
Contributor

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

@JackAKirk
Copy link
Contributor Author

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

In that case then I think it would be to revert any recent changes made to the cuda runners.

I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

@uditagarwal97
Copy link
Contributor

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

In that case then I think it would be to revert any recent changes made to the cuda runners.

I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

Yes, the PR LGTM overall.
My only concern is that is there someone from CodePlay, who is looking into the infrastructure issue with our self-hosted CUDA runner? It's been several days since we disabled CUDA tests from Nightly...

@JackAKirk
Copy link
Contributor Author

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

In that case then I think it would be to revert any recent changes made to the cuda runners.
I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

Yes, the PR LGTM overall. My only concern is that is there someone from CodePlay, who is looking into the infrastructure issue with our self-hosted CUDA runner? It's been several days since we disabled CUDA tests from Nightly...

afaik none of us have access to these runners. Did the problems begin when you changed the driver to 12.4? If that is the case, then I suggest undoing this.

@uditagarwal97
Copy link
Contributor

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

In that case then I think it would be to revert any recent changes made to the cuda runners.
I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

Yes, the PR LGTM overall. My only concern is that is there someone from CodePlay, who is looking into the infrastructure issue with our self-hosted CUDA runner? It's been several days since we disabled CUDA tests from Nightly...

afaik none of us have access to these runners.

@stdale-intel Can we share access to our self-hosted CUDA runner?

Did the problems begin when you changed the driver to 12.4? If that is the case, then I suggest undoing this.

Yes. Unfortunately, I can't roll back the driver because of incompatibility of the older driver with the Linux headers on this machine. (see #14041 (comment))

@JackAKirk
Copy link
Contributor Author

Do we also need to downgrade the CUDA driver/NVML version installed on the CUDA CI machine?

If it works currently as it is, then no, because the docker image that is currently used matches the 12.1 version I'm setting here, because currently "latest" is not used.
In order to have long term fixes for these issues I think we may need to sync up on driver versions in the way you suggest. But I don't expect there will be a real reason for us to do this until CUDA 13 that is expected later this year. We may as well deal with the situation as it is at that time, which is likely to change wrt now.

Currently, the CI CUDA machine has 550.54.14 driver installed, which IIUC, corresponds to CUDA 12.4. I think we can use CUDA 12.1 with the current driver version, right?

Does it work currently? If there are no problems with the CI CUDA machine currently then it must be OK, because the CI docker is currently pointing to an old image that already had cuda 12.1.

IIRC, there are pending infrastructure issues with our self-hosted CUDA runner. That's why we have to disable CUDA testing from Nightly in #14041 (comment) .

In that case then I think it would be to revert any recent changes made to the cuda runners.
I hope it is clear that this current PR will have no affect on CI when it is merged. It is purely to allow any future changes to the docker that people might want to make without the 12.5 image that breaks things.

Yes, the PR LGTM overall. My only concern is that is there someone from CodePlay, who is looking into the infrastructure issue with our self-hosted CUDA runner? It's been several days since we disabled CUDA tests from Nightly...

afaik none of us have access to these runners.

@stdale-intel Can we share access to our self-hosted CUDA runner?

Did the problems begin when you changed the driver to 12.4? If that is the case, then I suggest undoing this.

Yes. Unfortunately, I can't roll back the driver because of incompatibility of the older driver with the Linux headers on this machine. (see #14041 (comment))

I see. Until we have access to these machines, I'm not sure how we can help with this a great deal.

@aelovikov-intel aelovikov-intel merged commit f43e8c4 into intel:sycl Jun 14, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants