v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

Akshay-Venkatesh · 2024-08-13T17:50:09Z

Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.

bot:notacherrypick

github-actions · 2024-08-13T17:50:46Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d2921b0: opal/cuda: avoid direct access to cumem host numa ...

check_signed_off: does not contain a valid Signed-off-by line
check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2024-08-13T17:56:31Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

9cd2372: opal/cuda: avoid direct access to cumem host numa ...

check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

github-actions · 2024-08-13T18:06:56Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b72a410: opal/cuda: avoid direct access to cumem host numa ...

check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

jsquyres · 2024-08-13T18:41:08Z

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Akshay-Venkatesh · 2024-08-13T18:59:01Z

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon.

jsquyres · 2024-08-13T19:02:08Z

Ok, good enough.

Signed-off-by: Akshay Venkatesh <[email protected]> bot:notacherrypick

PR was changed after review

jsquyres · 2024-08-14T12:07:40Z

@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested?

Akshay-Venkatesh · 2024-08-14T18:05:49Z

@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything.

bosilca · 2024-08-14T15:30:51Z

config/opal_check_cuda.m4

@@ -113,6 +114,12 @@ AS_IF([test "$opal_check_cuda_happy"="yes"],
        [#include <$opal_cuda_incdir/cuda.h>]),
    [])

+# If we have CUDA support, check to see if we have support for cuMemCreate memory on host NUMA.
+AS_IF([test "$opal_check_cuda_happy"="yes"],


Here you can simply check that CUDA has been already found and then (without adding the path to the header file in the #include) you can use AC_CHECK_DECL.

If you need to manipulate the location of the header, save the CPPFLAGS do your detection then restore it. Bu here, CUDA has already been detected, which means you should not need to change CPPFLAGS.

bosilca · 2024-08-15T05:13:31Z

opal/mca/common/cuda/common_cuda.c

+    CUmemGenericAllocationHandle alloc_handle;
+    /* Check if memory is allocated using VMM API and see if host memory needs
+     * to be treated as pinned device memory */
+    result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf);


This looks not only overly complicated but also incorrect.

Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle must be matched with a call to cuMemRelease, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.

What exactly do you get from the combination cuMemRetainAllocationHandle + cuMemGetAllocationPropertiesFromHandle that you could not have obtained from cuMemGetAccess ?

jsquyres · 2024-09-10T15:17:42Z

Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged.

github-actions bot added this to the v4.1.7 milestone Aug 13, 2024

Akshay-Venkatesh requested a review from janjust August 13, 2024 17:50

github-actions bot added the Target: v4.1.x label Aug 13, 2024

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d2921b0 to 9cd2372 Compare August 13, 2024 17:55

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 9cd2372 to b72a410 Compare August 13, 2024 18:06

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from b72a410 to dc7932b Compare August 13, 2024 18:19

Akshay-Venkatesh assigned janjust Aug 13, 2024

janjust previously approved these changes Aug 13, 2024

View reviewed changes

jsquyres added the RM approved label Aug 13, 2024

Akshay-Venkatesh mentioned this pull request Aug 14, 2024

opal/cuda: Handle CUDA VMM pointers in accelerator check_addr function #12757

Merged

opal/cuda: avoid direct access to cumem host numa memory

384d8bd

Signed-off-by: Akshay Venkatesh <[email protected]> bot:notacherrypick

Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from dc7932b to 384d8bd Compare August 14, 2024 06:50

bosilca reviewed Aug 15, 2024

View reviewed changes

jsquyres removed the RM approved label Aug 15, 2024

janjust changed the title ~~opal/cuda: avoid direct access to cumem host numa memory~~ v4.1.x: opal/cuda: avoid direct access to cumem host numa memory Aug 23, 2024

jsquyres marked this pull request as draft September 10, 2024 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

Akshay-Venkatesh commented Aug 13, 2024 •

edited by janjust

Loading

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

jsquyres commented Aug 13, 2024

Akshay-Venkatesh commented Aug 13, 2024

jsquyres commented Aug 13, 2024

jsquyres commented Aug 14, 2024

Akshay-Venkatesh commented Aug 14, 2024

bosilca Aug 14, 2024

bosilca Aug 15, 2024

jsquyres commented Sep 10, 2024

v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

Are you sure you want to change the base?

v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

Conversation

Akshay-Venkatesh commented Aug 13, 2024 • edited by janjust Loading

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

jsquyres commented Aug 13, 2024

Akshay-Venkatesh commented Aug 13, 2024

jsquyres commented Aug 13, 2024

jsquyres commented Aug 14, 2024

Akshay-Venkatesh commented Aug 14, 2024

bosilca Aug 14, 2024

Choose a reason for hiding this comment

bosilca Aug 15, 2024

Choose a reason for hiding this comment

jsquyres commented Sep 10, 2024

Akshay-Venkatesh commented Aug 13, 2024 •

edited by janjust

Loading