Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build-and-test workflow is failing due to CUDA runtime version #7668

Closed
xmfcx opened this issue Jun 24, 2024 · 11 comments
Closed

build-and-test workflow is failing due to CUDA runtime version #7668

xmfcx opened this issue Jun 24, 2024 · 11 comments
Assignees
Labels
type:bug Software flaws or errors.

Comments

@xmfcx
Copy link
Contributor

xmfcx commented Jun 24, 2024

@knzo25 starting from this PR, The CI for build-and-test started failing:

8: C++ exception with description "cudaErrorInsufficientDriver (35)@/__w/autoware.universe/autoware.universe/perception/lidar_centerpoint/include/lidar_centerpoint/cuda_utils.hpp#L80: CUDA driver version is insufficient for CUDA runtime version" thrown in the test body.

For some reason the has CI passed for the build-and-test-differential cuda yet it fails the build-and-test checks.

Originally posted by @xmfcx in #6989 (comment)

@xmfcx xmfcx added the type:bug Software flaws or errors. label Jun 24, 2024
@knzo25
Copy link
Contributor

knzo25 commented Jun 25, 2024

@xmfcx
Kind of a nasty one as I ran the tests a while ago with no errors. Will look into it asap, but this week I am full of deadlines...

@knzo25
Copy link
Contributor

knzo25 commented Jun 26, 2024

@youtalk

I tried to reproduce the error using a docker and following the ci/cd commands whenever possible.
However, the tests pass:

9: -- run_test.py: invoking following command in '/home/kenzolobos/workspace/autoware/build/lidar_centerpoint':
9:  - /home/kenzolobos/workspace/autoware/build/lidar_centerpoint/test_preprocess_kernel
9: [==========] Running 4 tests from 1 test suite.
9: [----------] Global test environment set-up.
9: [----------] 4 tests from PreprocessKernelTest
9: [ RUN      ] PreprocessKernelTest.EmptyVoxelTest
9: [       OK ] PreprocessKernelTest.EmptyVoxelTest (45 ms)
9: [ RUN      ] PreprocessKernelTest.BasicTest
9: [       OK ] PreprocessKernelTest.BasicTest (7 ms)
9: [ RUN      ] PreprocessKernelTest.OutOfRangeTest
9: [       OK ] PreprocessKernelTest.OutOfRangeTest (7 ms)
9: [ RUN      ] PreprocessKernelTest.VoxelOverflowTest
9: [       OK ] PreprocessKernelTest.VoxelOverflowTest (7 ms)
9: [----------] 4 tests from PreprocessKernelTest (66 ms total)
9: 
9: [----------] Global test environment tear-down
9: [==========] 4 tests from 1 test suite ran. (66 ms total)
9: [  PASSED  ] 4 tests.

It seems it is an issue on the docker side:
https://forums.developer.nvidia.com/t/cudaerrorinsufficientdriver-cuda-driver-version-is-insufficient-for-cuda-runtime-version-in-docker-container/294569/10

cudaErrorInsufficientDriver = 35
This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration. Users should install an updated NVIDIA display driver to allow the application to run.

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

Thanks @knzo25 , this explains why it specifically failed for the self hosted machines. I will try installing the driver on the host machines and see.

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

I've installed CUDA 12.3 to both machines.

Running again:

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

They both failed with the same error: https://github.com/autowarefoundation/autoware.universe/actions/runs/9675999034/job/26698258077#step:15:22044

😕

The host machines have the necessary stuff and I've updated the rest of the machines with sudo apt update && sudo apt dist-upgrade. Then restarted them for good measure too.

Here are the results from host machine for both:

leo-copper

mfc@copper:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
mfc@copper:~$ nvidia-smi
Wed Jun 26 13:44:42 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1070        Off |   00000000:01:00.0  On |                  N/A |
| N/A   52C    P8             11W /  125W |     356MiB /   8192MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1086      G   /usr/lib/xorg/Xorg                            237MiB |
|    0   N/A  N/A      1333      G   /usr/bin/gnome-shell                          115MiB |
+-----------------------------------------------------------------------------------------+

common-runner-x64-01

This one doesn't have a graphics card, it is a c6a.xlarge instance.

ubuntu@ip-172-31-45-223:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
ubuntu@ip-172-31-45-223:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@knzo25 I don't know what is missing then :(

@knzo25
Copy link
Contributor

knzo25 commented Jun 26, 2024

I have about the same as you 😢

kenzolobos@desktop:~/workspace/autoware$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
kenzolobos@desktop:~/workspace/autoware$ nvidia-smi
Wed Jun 26 19:59:29 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |

I was going to recommend a reboot but you already did it. In the past when changing versions, unload and reload the kernel modules worked when nvidia-smi did not work, but this is not the case. Do you have the cuda samples in that machine to check if those run?

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

Do you have the cuda samples in that machine to check if those run?

I followed the regular steps as always while installing from here: https://github.com/autowarefoundation/autoware/tree/main/ansible/roles/cuda#manual-installation

I will install the nvidia-driver-550 and try again. (this is what I have on my daily work pc as well)

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

I think I've understood everything now.

VoxelGeneratorTest.TwoFramesNoTf this test is failing.

What are the changes that caused it?

In the PR here

  auto points_d = cuda::make_unique<float[]>(capacity_ * config.point_feature_size_);
  cudaMemcpy(
    points_d.get(), points.data(), capacity_ * config.point_feature_size_ * sizeof(float),
    cudaMemcpyHostToDevice);

CUDA calls are being made.

I think before these, no serious CUDA calls were being made. Most of them could run on cpu too probably.

What are the runner specs?

GitHub hosted runners

These are CPU only, here are their specs.

Right now every job except:

  • ARM64 workflows
  • build-and-test
  • build-and-test-daily

are running on them.

Self-hosted runners

And we have 2 machines here:

  • leo-copper: has GTX1070
  • common-runner-x64-01 c6a.xlarge cpu-only

These run:

  • build-and-test
  • build-and-test-daily

⚠️ nvidia_container_toolkit was not installed on these machines.

Then how did it pass the b&t-diff in the first place then?

This is the first fishy part from the lidar_centerpoint PR b&t-diff CI run:

Finished <<< lidar_centerpoint [1min 37s]

On my high end machine,

Finished <<< lidar_centerpoint [4min 45s]

This is too fast for this package.

And looking at its tests:

Almost no tests are performed, including VoxelGeneratorTest.TwoFramesNoTf

I didn't investigate deeper on why this didn't run.

Verdict

I think, until this PR, no serious CUDA code was in the colcon tests before.
Mostly simple stuff that could also be run on the CPU were tested.

For CUDA only tests to run, we need CUDA capable machines with GPUs.

These tests cannot be done on neither GitHub hosted machines nor the AWS cpu-only runner that we have.

I have now installed the nvidia_container_toolkit on the leo-copper machine with GTX1070 and restarted the machine.

Started the test again. But it will probably fail because I don't think when GitHub initiates the containers, it passes --gpus all flag to the containers.

I will look into it to see how I can do that for that machine.

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 26, 2024

Then how did it pass the b&t-diff in the first place then?

Found out the bug in:

Now it fails on build-and-test-differential too.

@xmfcx xmfcx mentioned this issue Jun 27, 2024
@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 27, 2024

The tests are alright because when I've configured leo-copper machine, it passed entire build-and-test successfully.
See:

But we will have to disable the tests that fail on non-cuda capable machines. Because we don't have the infrastructure ready to handle gpu based testing for every PR etc.

I will open up an issue to track disabled tests that should be re-enabled once CUDA capable machines are back.

@xmfcx
Copy link
Contributor Author

xmfcx commented Jun 27, 2024

image

With all the CI cache related issues are fixed, CI filter issues solved, GPU requiring tests are disabled and tracked, and most importantly, build-and-test now passing, I think we can close this issue.

If you have any questions left, please feel free to ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Software flaws or errors.
Projects
Development

No branches or pull requests

3 participants