Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751

Draft
wants to merge 1 commit into
base: v4.1.x
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Aug 13, 2024

Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.

bot:notacherrypick

@github-actions github-actions bot added this to the v4.1.7 milestone Aug 13, 2024
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

d2921b0: opal/cuda: avoid direct access to cumem host numa ...

  • check_signed_off: does not contain a valid Signed-off-by line
  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from d2921b0 to 9cd2372 Compare August 13, 2024 17:55
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

9cd2372: opal/cuda: avoid direct access to cumem host numa ...

  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from 9cd2372 to b72a410 Compare August 13, 2024 18:06
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b72a410: opal/cuda: avoid direct access to cumem host numa ...

  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

janjust
janjust previously approved these changes Aug 13, 2024
@jsquyres
Copy link
Member

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

@Akshay-Venkatesh
Copy link
Contributor Author

@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x?

Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon.

@jsquyres
Copy link
Member

Ok, good enough.

Signed-off-by: Akshay Venkatesh <[email protected]>

bot:notacherrypick
@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/detect-host-numa-as-device-mem branch from dc7932b to 384d8bd Compare August 14, 2024 06:50
@jsquyres jsquyres dismissed janjust’s stale review August 14, 2024 12:06

PR was changed after review

@jsquyres
Copy link
Member

@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested?

@Akshay-Venkatesh
Copy link
Contributor Author

@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything.

@@ -113,6 +114,12 @@ AS_IF([test "$opal_check_cuda_happy"="yes"],
[#include <$opal_cuda_incdir/cuda.h>]),
[])

# If we have CUDA support, check to see if we have support for cuMemCreate memory on host NUMA.
AS_IF([test "$opal_check_cuda_happy"="yes"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can simply check that CUDA has been already found and then (without adding the path to the header file in the #include) you can use AC_CHECK_DECL.

If you need to manipulate the location of the header, save the CPPFLAGS do your detection then restore it. Bu here, CUDA has already been detected, which means you should not need to change CPPFLAGS.

CUmemGenericAllocationHandle alloc_handle;
/* Check if memory is allocated using VMM API and see if host memory needs
* to be treated as pinned device memory */
result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks not only overly complicated but also incorrect.

Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle must be matched with a call to cuMemRelease, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.

What exactly do you get from the combination cuMemRetainAllocationHandle + cuMemGetAllocationPropertiesFromHandle that you could not have obtained from cuMemGetAccess ?

@janjust janjust changed the title opal/cuda: avoid direct access to cumem host numa memory v4.1.x: opal/cuda: avoid direct access to cumem host numa memory Aug 23, 2024
@jsquyres jsquyres marked this pull request as draft September 10, 2024 15:14
@jsquyres
Copy link
Member

Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants