-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/han/alltoallv Bugfixes #12806
base: main
Are you sure you want to change the base?
coll/han/alltoallv Bugfixes #12806
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: e6c34d3: coll/han/allreduce: squelch a particularly noisy l...
cb82581: coll/han/alltoallv: handle errors better
e3e9e1f: coll/han/alltoallv: correct loop condition on empt...
577968e: coll/han: alltoallv should fall-back if transform ...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
A status update: These commits are good as they are, but I think I have found evidence of possible data corruption. I am working to identify and fix the issue, and then I'll add them to this PR. |
e6c34d3
to
c1ddbd5
Compare
c1ddbd5
to
4fafbfe
Compare
Hello! The Git Commit Checker CI bot found a few problems with this PR: a50519c: coll/han/alltoallv: Correct swapped send/recv data...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Signed-off-by: Luke Robison <[email protected]>
mca_coll_han_alltoallv_using_smsc is not compatible with an in-place transform. Update comments and fall-back logic to catch this case. Signed-off-by: Luke Robison <[email protected]>
This change makes the loop condition more straightforward, and corrects a logic error that was previously seen when a rank had 0 bytes to receive. With the new logic the first pass through the receive-positing section will walk through all datatypes, and never post a recv buffer. Sender logic remains unchanged. Signed-off-by: Luke Robison <[email protected]>
Previously if one rank returned an error, it would immediately exit the collective with an error which likely causes that pid to immediately exit with an error. Unfortunately this causes immediate segfaults as other ranks may still be accessing shared memory, often causing cascading walls of segfault text. I have moved the existing MPI_barrier so that all processes must pass it before exit regardless of error condition. This means that either errors will hang, or all ranks will encounter an error. Signed-off-by: Luke Robison <[email protected]>
Signed-off-by: Luke Robison <[email protected]>
Previously this function considered only displ, count, and extent. Since the function uses XPMEM to explicitly expose memory regions, we must also be aware of types that have negative lower bounds and might access data _before_ the user-provided pointer. This change more accurately compute the true upper and lower bounds of all memory accesses, both to insure we don't try to map regions of memory that may not be in our VM page table, and to ensure we expose all memory that will be accessed. Signed-off-by: Luke Robison <[email protected]>
4fafbfe
to
161fd69
Compare
The logic for waitany was flawed. While we could wait for either a send or a receive, we cannot consume receives in any order, and likewise for sends. Fix this by simply ping-ponging between waiting for sends or receives and cycling when there is nothing to wait on. Signed-off-by: Luke Robison <[email protected]>
18a5ada
to
bf267b3
Compare
@bosilca At long last this is ready for review again. Thank you for your patience. I have made a rather exhaustive test suite. I am trying to figure out the best place to add it. Perhaps to MTT I suppose. Any other suggestions? |
Change to a greedy recv-posting algorithm: always pre-post and cancel later. This is motivated by the observation that the convertor does not always use every byte in the buffer, which makes the previous method of counting how many bytes will arrive next innacurate. Signed-off-by: Luke Robison <[email protected]>
@@ -132,7 +132,7 @@ mca_coll_han_allreduce_intra(const void *sbuf, | |||
seg_count); | |||
|
|||
/* Determine number of elements sent per task. */ | |||
OPAL_OUTPUT_VERBOSE((10, mca_coll_han_component.han_output, | |||
OPAL_OUTPUT_VERBOSE((30, mca_coll_han_component.han_output, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed rather noisy when I was trying to view my own logging. It has its own commit, maybe a separate PR would be better.
void *sbuf; | ||
void *rbuf; | ||
void *serialization_buffer; | ||
}; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this sucks and all collectives have exactly the same issue. That's why we already have a function for it opal_datatype_span
.
continue; | ||
} | ||
rc = ompi_request_wait(&requests[jreq], MPI_STATUS_IGNORE); | ||
if (rc) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you use this everywhere. Safe programming is good, but can you please add some info about what went wrong instead of just breaking.
@@ -194,31 +270,43 @@ static inline int alltoallv_sendrecv_w_direct_for_debugging( | |||
jrecvs_completed++; | |||
} | |||
|
|||
requests[jreq] = &ompi_request_null.request; | |||
if (ii_send_req && jsends_posted < ntypes_send) { | |||
rc = ompi_datatype_create_contiguous( 1, (ompi_datatype_t*)send_types[jsends_posted], &yuck_ompi_dtype_from_opal ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you really called this datatype yuck !? It must be an acronym right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to synthesize it from an opal datatype on-demand so that I could pass it to sendrecv. I can rename it. Note this function is not called unless the code is modified.
if (rc) { break; }; | ||
} | ||
if (rc) { | ||
opal_output_verbose(1, mca_coll_han_component.han_output, | ||
"Failed in alltoallv_sendrecv_w_direct_for_debugging: jloop=%d, rc=%d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what debugging ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function is inefficient and not used. However if you doubt that something is correct in all the business logic of the main sendrecv_w, then it is a easy code change to use this function to double-check.
Is there a better way to keep such a function but also keep it out of the way?
@@ -250,16 +338,20 @@ static int alltoallv_sendrecv_w( | |||
buf_items[jbuf] = opal_free_list_get(&mca_coll_han_component.pack_buffers); | |||
if (buf_items[jbuf] == NULL) { | |||
nbufs = jbuf - 1; | |||
printf("Uh-oh, not enough buffers: %d\n",nbufs); | |||
opal_output_verbose(20, mca_coll_han_component.han_output, | |||
"Uh-oh, not enough buffers: %d\n",nbufs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but maybe we can make this output a more user-friendly ?
We take path 2 here, to avoid possible race conditions between the | ||
cancel and the possibility of a matching recv. | ||
|
||
After realizing that the convertor may decide to not fully pack the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about the meaning of this ? The convertor might decide to avoid stopping in the middle of a predefined type. Is this what this comment is about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I intended to say. I observed a strange type consisting of 1 char and 1 int, and the convertor refused to use the last 3 bytes of the buffer, which threw off my original counting scheme.
Thank you George. I will be on a vacation for the next week and then I will address your comments! |
This series of commits fixes several errors, and changes behavior on error conditions:
While debugging an application which exposed (2) above, I found that the rank which first experienced the error tried to cleanly exit with an error return code. However doing so while other ranks are accessing the shared memory region exposed by SMSC caused all the other local ranks to segfault. This hampered my progress during debug. To address it, I have moved the barrier that we previously only passed during successful exits, to be passed during all exit conditions. This will hang the execution instead of segfaulting other local ranks. I think this is better but I welcome other perspectives here.