build-and-test workflow is failing due to CUDA runtime version #7668

xmfcx · 2024-06-24T14:08:16Z

@knzo25 starting from this PR, The CI for build-and-test started failing:

https://github.com/autowarefoundation/autoware.universe/actions/runs/9590002364/job/26444678737#step:15:22008

8: C++ exception with description "cudaErrorInsufficientDriver (35)@/__w/autoware.universe/autoware.universe/perception/lidar_centerpoint/include/lidar_centerpoint/cuda_utils.hpp#L80: CUDA driver version is insufficient for CUDA runtime version" thrown in the test body.

For some reason the has CI passed for the build-and-test-differential cuda yet it fails the build-and-test checks.

Originally posted by @xmfcx in #6989 (comment)

The text was updated successfully, but these errors were encountered:

knzo25 · 2024-06-25T01:21:37Z

@xmfcx
Kind of a nasty one as I ran the tests a while ago with no errors. Will look into it asap, but this week I am full of deadlines...

knzo25 · 2024-06-26T04:25:23Z

@youtalk

I tried to reproduce the error using a docker and following the ci/cd commands whenever possible.
However, the tests pass:

9: -- run_test.py: invoking following command in '/home/kenzolobos/workspace/autoware/build/lidar_centerpoint':
9:  - /home/kenzolobos/workspace/autoware/build/lidar_centerpoint/test_preprocess_kernel
9: [==========] Running 4 tests from 1 test suite.
9: [----------] Global test environment set-up.
9: [----------] 4 tests from PreprocessKernelTest
9: [ RUN      ] PreprocessKernelTest.EmptyVoxelTest
9: [       OK ] PreprocessKernelTest.EmptyVoxelTest (45 ms)
9: [ RUN      ] PreprocessKernelTest.BasicTest
9: [       OK ] PreprocessKernelTest.BasicTest (7 ms)
9: [ RUN      ] PreprocessKernelTest.OutOfRangeTest
9: [       OK ] PreprocessKernelTest.OutOfRangeTest (7 ms)
9: [ RUN      ] PreprocessKernelTest.VoxelOverflowTest
9: [       OK ] PreprocessKernelTest.VoxelOverflowTest (7 ms)
9: [----------] 4 tests from PreprocessKernelTest (66 ms total)
9: 
9: [----------] Global test environment tear-down
9: [==========] 4 tests from 1 test suite ran. (66 ms total)
9: [  PASSED  ] 4 tests.

It seems it is an issue on the docker side:
https://forums.developer.nvidia.com/t/cudaerrorinsufficientdriver-cuda-driver-version-is-insufficient-for-cuda-runtime-version-in-docker-container/294569/10

cudaErrorInsufficientDriver = 35
This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration. Users should install an updated NVIDIA display driver to allow the application to run.

xmfcx · 2024-06-26T05:10:33Z

Thanks @knzo25 , this explains why it specifically failed for the self hosted machines. I will try installing the driver on the host machines and see.

xmfcx · 2024-06-26T09:35:43Z

I've installed CUDA 12.3 to both machines.

Running again:

common-runner-x64-01: https://github.com/autowarefoundation/autoware.universe/actions/runs/9676012953
leo-copper: https://github.com/autowarefoundation/autoware.universe/actions/runs/9675999034

xmfcx · 2024-06-26T10:51:24Z

They both failed with the same error: https://github.com/autowarefoundation/autoware.universe/actions/runs/9675999034/job/26698258077#step:15:22044

😕

The host machines have the necessary stuff and I've updated the rest of the machines with sudo apt update && sudo apt dist-upgrade. Then restarted them for good measure too.

Here are the results from host machine for both:

`leo-copper`

mfc@copper:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
mfc@copper:~$ nvidia-smi
Wed Jun 26 13:44:42 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1070        Off |   00000000:01:00.0  On |                  N/A |
| N/A   52C    P8             11W /  125W |     356MiB /   8192MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1086      G   /usr/lib/xorg/Xorg                            237MiB |
|    0   N/A  N/A      1333      G   /usr/bin/gnome-shell                          115MiB |
+-----------------------------------------------------------------------------------------+

`common-runner-x64-01`

This one doesn't have a graphics card, it is a c6a.xlarge instance.

ubuntu@ip-172-31-45-223:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
ubuntu@ip-172-31-45-223:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@knzo25 I don't know what is missing then :(

knzo25 · 2024-06-26T11:09:33Z

I have about the same as you 😢

kenzolobos@desktop:~/workspace/autoware$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
kenzolobos@desktop:~/workspace/autoware$ nvidia-smi
Wed Jun 26 19:59:29 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |

I was going to recommend a reboot but you already did it. In the past when changing versions, unload and reload the kernel modules worked when nvidia-smi did not work, but this is not the case. Do you have the cuda samples in that machine to check if those run?

xmfcx · 2024-06-26T11:12:07Z

Do you have the cuda samples in that machine to check if those run?

I followed the regular steps as always while installing from here: https://github.com/autowarefoundation/autoware/tree/main/ansible/roles/cuda#manual-installation

I will install the nvidia-driver-550 and try again. (this is what I have on my daily work pc as well)

xmfcx · 2024-06-26T12:14:02Z

I think I've understood everything now.

VoxelGeneratorTest.TwoFramesNoTf this test is failing.

What are the changes that caused it?

In the PR here

  auto points_d = cuda::make_unique<float[]>(capacity_ * config.point_feature_size_);
  cudaMemcpy(
    points_d.get(), points.data(), capacity_ * config.point_feature_size_ * sizeof(float),
    cudaMemcpyHostToDevice);

CUDA calls are being made.

I think before these, no serious CUDA calls were being made. Most of them could run on cpu too probably.

What are the runner specs?

GitHub hosted runners

These are CPU only, here are their specs.

Right now every job except:

ARM64 workflows
build-and-test
build-and-test-daily

are running on them.

Self-hosted runners

And we have 2 machines here:

leo-copper: has GTX1070
common-runner-x64-01 c6a.xlarge cpu-only

These run:

build-and-test
build-and-test-daily

⚠️ nvidia_container_toolkit was not installed on these machines.

Then how did it pass the b&t-diff in the first place then?

This is the first fishy part from the lidar_centerpoint PR b&t-diff CI run:

Finished <<< lidar_centerpoint [1min 37s]

On my high end machine,

Finished <<< lidar_centerpoint [4min 45s]

This is too fast for this package.

And looking at its tests:

Almost no tests are performed, including VoxelGeneratorTest.TwoFramesNoTf

I didn't investigate deeper on why this didn't run.

Verdict

I think, until this PR, no serious CUDA code was in the colcon tests before.
Mostly simple stuff that could also be run on the CPU were tested.

For CUDA only tests to run, we need CUDA capable machines with GPUs.

These tests cannot be done on neither GitHub hosted machines nor the AWS cpu-only runner that we have.

I have now installed the nvidia_container_toolkit on the leo-copper machine with GTX1070 and restarted the machine.

Started the test again. But it will probably fail because I don't think when GitHub initiates the containers, it passes --gpus all flag to the containers.

I will look into it to see how I can do that for that machine.

xmfcx · 2024-06-26T20:06:41Z

Then how did it pass the b&t-diff in the first place then?

Found out the bug in:

ci: debug lidar_centerpoint ci failure #7712 (comment)

Now it fails on build-and-test-differential too.

xmfcx · 2024-06-27T11:16:00Z

The tests are alright because when I've configured leo-copper machine, it passed entire build-and-test successfully.
See:

But we will have to disable the tests that fail on non-cuda capable machines. Because we don't have the infrastructure ready to handle gpu based testing for every PR etc.

I will open up an issue to track disabled tests that should be re-enabled once CUDA capable machines are back.

xmfcx · 2024-06-27T16:46:01Z

With all the CI cache related issues are fixed, CI filter issues solved, GPU requiring tests are disabled and tracked, and most importantly, build-and-test now passing, I think we can close this issue.

If you have any questions left, please feel free to ask.

xmfcx assigned knzo25 Jun 24, 2024

xmfcx added the type:bug Software flaws or errors. label Jun 24, 2024

xmfcx assigned youtalk and xmfcx Jun 24, 2024

xmfcx mentioned this issue Jun 26, 2024

ci: show cuda and nvidia-driver version in build&test ci #7694

Draft

xmfcx mentioned this issue Jun 26, 2024

ci: debug lidar_centerpoint ci failure #7712

Closed

xmfcx mentioned this issue Jun 27, 2024

ci: run on gpu #7721

Closed

This was referenced Jun 27, 2024

ci(lidar_centerpoint): disable failing gpu dependent test_voxel_generator #7722

Merged

ci(build-and-test-x): fix cache and merge clang-tidy #7723

Merged

Tracking tests disabled due to lack of gpu capable machines #7724

Open

xmfcx closed this as completed Jun 27, 2024

xmfcx mentioned this issue Jun 27, 2024

Flaky CI - build-and-test - lidar_centerpoint #7511

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build-and-test workflow is failing due to CUDA runtime version #7668

build-and-test workflow is failing due to CUDA runtime version #7668

xmfcx commented Jun 24, 2024 •

edited

Loading

knzo25 commented Jun 25, 2024 •

edited

Loading

knzo25 commented Jun 26, 2024 •

edited

Loading

xmfcx commented Jun 26, 2024 •

edited

Loading

xmfcx commented Jun 26, 2024

xmfcx commented Jun 26, 2024 •

edited

Loading

knzo25 commented Jun 26, 2024

xmfcx commented Jun 26, 2024

xmfcx commented Jun 26, 2024

xmfcx commented Jun 26, 2024 •

edited

Loading

Then how did it pass the b&t-diff in the first place then?

xmfcx commented Jun 27, 2024 •

edited

Loading

xmfcx commented Jun 27, 2024 •

edited

Loading

build-and-test workflow is failing due to CUDA runtime version #7668

build-and-test workflow is failing due to CUDA runtime version #7668

Comments

xmfcx commented Jun 24, 2024 • edited Loading

knzo25 commented Jun 25, 2024 • edited Loading

knzo25 commented Jun 26, 2024 • edited Loading

xmfcx commented Jun 26, 2024 • edited Loading

xmfcx commented Jun 26, 2024

xmfcx commented Jun 26, 2024 • edited Loading

leo-copper

common-runner-x64-01

knzo25 commented Jun 26, 2024

xmfcx commented Jun 26, 2024

xmfcx commented Jun 26, 2024

What are the changes that caused it?

What are the runner specs?

GitHub hosted runners

Self-hosted runners

Then how did it pass the b&t-diff in the first place then?

Verdict

xmfcx commented Jun 26, 2024 • edited Loading

Then how did it pass the b&t-diff in the first place then?

xmfcx commented Jun 27, 2024 • edited Loading

xmfcx commented Jun 27, 2024 • edited Loading

xmfcx commented Jun 24, 2024 •

edited

Loading

knzo25 commented Jun 25, 2024 •

edited

Loading

knzo25 commented Jun 26, 2024 •

edited

Loading

xmfcx commented Jun 26, 2024 •

edited

Loading

xmfcx commented Jun 26, 2024 •

edited

Loading

`leo-copper`

`common-runner-x64-01`

xmfcx commented Jun 26, 2024 •

edited

Loading

xmfcx commented Jun 27, 2024 •

edited

Loading

xmfcx commented Jun 27, 2024 •

edited

Loading