-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751
base: v4.1.x
Are you sure you want to change the base?
v4.1.x: opal/cuda: avoid direct access to cumem host numa memory #12751
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: d2921b0: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
d2921b0
to
9cd2372
Compare
Hello! The Git Commit Checker CI bot found a few problems with this PR: 9cd2372: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
9cd2372
to
b72a410
Compare
Hello! The Git Commit Checker CI bot found a few problems with this PR: b72a410: opal/cuda: avoid direct access to cumem host numa ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
b72a410
to
dc7932b
Compare
@Akshay-Venkatesh @janjust So this isn't needed / doesn't exist in main/v5.0.x? |
Hi @jsquyres . It is needed but the changes will go into accelerator code paths that are quite different from those that exist in 4.1.x series. I'll post a PR soon. |
Ok, good enough. |
Signed-off-by: Akshay Venkatesh <[email protected]> bot:notacherrypick
dc7932b
to
384d8bd
Compare
@Akshay-Venkatesh You just changed this PR significantly. Is it complete and fully tested? |
@jsquyres After making changes to main branch I noticed that similar code would fit for 4.1.x and I had missed an additional check that was needed before marking memory as device vs host. I made those changes to both my branches and I've tested this extensively to make sure everything passes. Would appreciate another round of reviews to make sure I didn't miss anything. |
@@ -113,6 +114,12 @@ AS_IF([test "$opal_check_cuda_happy"="yes"], | |||
[#include <$opal_cuda_incdir/cuda.h>]), | |||
[]) | |||
|
|||
# If we have CUDA support, check to see if we have support for cuMemCreate memory on host NUMA. | |||
AS_IF([test "$opal_check_cuda_happy"="yes"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can simply check that CUDA has been already found and then (without adding the path to the header file in the #include
) you can use AC_CHECK_DECL
.
If you need to manipulate the location of the header, save the CPPFLAGS do your detection then restore it. Bu here, CUDA has already been detected, which means you should not need to change CPPFLAGS.
CUmemGenericAllocationHandle alloc_handle; | ||
/* Check if memory is allocated using VMM API and see if host memory needs | ||
* to be treated as pinned device memory */ | ||
result = cuFunc.cuMemRetainAllocationHandle(&alloc_handle, (void*)dbuf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks not only overly complicated but also incorrect.
Regarding correctness: according to the CUDA documentation each call the cuMemRetainAllocationHandle
must be matched with a call to cuMemRelease
, which i don't see in this PR. This will result in the memory region referenced here not being able to be released.
What exactly do you get from the combination cuMemRetainAllocationHandle
+ cuMemGetAllocationPropertiesFromHandle
that you could not have obtained from cuMemGetAccess
?
Put this back in Draft mode, because @bosilca's last comments on here were voicing objections (and I don't want to accidentally merge it). So let's get those objections addressed, and then this can get merged. |
Memory allocated using cumemcreate API with location as {CU_MEM_LOCATION_TYPE_HOST/CU_MEM_LOCATION_TYPE_HOST_NUMA/CU_MEM_LOCATION_TYPE_HOST _NUMA_CURRENT} can be detected as host memory type by pointer query API but this doesn't allow the CPU to access such memory using memcpy or other CPU load/store mechanisms unless explicitly requested with cuMemSetAccess. Without the changes in this PR, HOST_NUMA backed cumemcreate memory is detected as host by openmpi layers (opal/datatype, ompi/coll) and subsequent accesses by CPU thread leads to illegal access errors.
bot:notacherrypick