Move GPU CI pipelines from old daint to new daint #1239

msimberg · 2024-09-10T12:18:09Z

don't allow failures on gh200 anymore

codacy-production · 2024-09-10T12:22:16Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ +0.01% (target: -1.00%)	✅ ∅ (target: 90.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`17f3c6f`)	18346	13774	75.08%
Head commit (`3ba13b4`)	18346 (+0)	13776 (+2)	75.09% (+0.01%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#1239)	0	0	∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

aurianer

Thanks a lot!

.gitlab/includes/clang14_cuda11_pipeline.yml

msimberg · 2024-09-11T14:04:24Z

.gitlab/pipelines_on_push.yml

@@ -9,3 +9,6 @@ include:
  - local: '.gitlab/includes/clang14_cuda11_pipeline.yml'
  - local: '.gitlab/includes/gcc12_hip6_pipeline.yml'
  - local: '.gitlab/includes/sloc.yml'
+  # TODO: move to on_merge before merging


msimberg · 2024-09-11T14:07:18Z

Exporting NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES="compute,utility" seems to be what was required to get the container images to load the correct drivers etc. and avoid

cudaErrorInsufficientDriver (CUDA driver version is insufficient for CUDA runtime version)

These are from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#constraints.

These work when testing manually, but don't seem to be work in CI yet.

msimberg · 2024-09-17T09:42:21Z

All right, we're making some progress:

The GCC 12/CUDA 12 pipeline is working now.
The clang/cuda one isn't working because it uses valgrind and valgrind doesn't seem to like some of the instructions used. I'll see if using a less specific arm instruction set may help valgrind here.
The CUDA 11 pipeline isn't working. I would've expected it to be compatible despite the old CUDA version. I'll see if changing required driver versions etc. helps at all.

I may end up disabling the test steps for the latter two in this PR to reenable them in separate PRs.

… variables, not in Dockerfile

msimberg · 2024-09-18T07:56:10Z

.gitlab/includes/common_pipeline.yml

  extends:
-    - .container-runner-daint-gpu
+    - .container-runner-todi-gh200 # TODO: daint? rename template?


msimberg · 2024-09-18T07:56:57Z

.gitlab/includes/gcc13_gh200_pipeline.yml

@@ -8,7 +8,7 @@ include:
  - local: '.gitlab/includes/common_pipeline.yml'


Remove pipeline?

msimberg · 2024-09-18T08:20:04Z

The clang/cuda configuration with valgrind no longer complains about illegal instructions: good. It now reports many issues, which I don't know yet if they're real or not.

I'll aim to get the GCC 12/CUDA 12 pipeline running properly (still some tweaks needed on the CSCS CI side apparently) and then I'll attempt to revive the two other CUDA configurations separately, possibly introducing another valgrind configuration on x86.

…ILITIES

msimberg added this to the 0.29.0 milestone Sep 10, 2024

msimberg self-assigned this Sep 10, 2024

msimberg force-pushed the cuda-pipelines-gh200 branch 4 times, most recently from d11f8c2 to acdfcb0 Compare September 10, 2024 16:05

aurianer mentioned this pull request Sep 10, 2024

Rename santis pipeline with gh200 + add a test step on daint-alps #1244

Closed

aurianer approved these changes Sep 10, 2024

View reviewed changes

.gitlab/includes/clang14_cuda11_pipeline.yml Outdated Show resolved Hide resolved

msimberg force-pushed the cuda-pipelines-gh200 branch 2 times, most recently from a72b2f8 to 2798c26 Compare September 11, 2024 14:01

msimberg commented Sep 11, 2024

View reviewed changes

msimberg force-pushed the cuda-pipelines-gh200 branch from b6105a5 to 4716cc4 Compare September 12, 2024 07:52

aurianer mentioned this pull request Sep 12, 2024

Enable testing for gh200 #1130

Closed

msimberg added 6 commits September 17, 2024 11:52

Move GPU CI pipelines from old daint to new daint

bdb4647

Rename CI templates with _rosa suffix to use _zen2 suffix

ca50adf

TEMP: Try todi runner for gh200 CI jobs

a92fd71

TEMP: Print environment in test stage in CI

6fde6a2

Set environment variables to expose CUDA devices in container runtime

c72521c

Use generic aarch64 architecture for valgrind CI pipeline

bd072ec

msimberg force-pushed the cuda-pipelines-gh200 branch from 0dea610 to bd072ec Compare September 17, 2024 09:52

Only set NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES in job…

3d5dc6a

… variables, not in Dockerfile

msimberg commented Sep 18, 2024

View reviewed changes

msimberg added 2 commits September 26, 2024 16:04

Use default values for NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPAB…

53f3733

…ILITIES

Enable valgrind testing on GCC 12 CI configuration

3ba13b4

aurianer mentioned this pull request Sep 27, 2024

[Very mich WIP] Move perftests reporting on PRs to alps #1255

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move GPU CI pipelines from old daint to new daint #1239

Move GPU CI pipelines from old daint to new daint #1239

msimberg commented Sep 10, 2024 •

edited

Loading

codacy-production bot commented Sep 10, 2024 •

edited

Loading

aurianer left a comment

msimberg Sep 11, 2024

msimberg commented Sep 11, 2024 •

edited

Loading

msimberg commented Sep 17, 2024

msimberg Sep 18, 2024

msimberg Sep 18, 2024

msimberg commented Sep 18, 2024

		@@ -8,7 +8,7 @@ include:
		- local: '.gitlab/includes/common_pipeline.yml'

Move GPU CI pipelines from old daint to new daint #1239

Are you sure you want to change the base?

Move GPU CI pipelines from old daint to new daint #1239

Conversation

msimberg commented Sep 10, 2024 • edited Loading

codacy-production bot commented Sep 10, 2024 • edited Loading

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

aurianer left a comment

Choose a reason for hiding this comment

msimberg Sep 11, 2024

Choose a reason for hiding this comment

msimberg commented Sep 11, 2024 • edited Loading

msimberg commented Sep 17, 2024

msimberg Sep 18, 2024

Choose a reason for hiding this comment

msimberg Sep 18, 2024

Choose a reason for hiding this comment

msimberg commented Sep 18, 2024

msimberg commented Sep 10, 2024 •

edited

Loading

codacy-production bot commented Sep 10, 2024 •

edited

Loading

msimberg commented Sep 11, 2024 •

edited

Loading