Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/han/alltoallv Bugfixes #12806

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Sep 9, 2024

This series of commits fixes several errors, and changes behavior on error conditions:

  1. This alltoallv was not written to support in-place transforms, but I forgot to check for it. Add a simple catch for that case.
  2. The looping logic was bad in cases where ranks had no data to receive. Fix that logic.
  3. There was an incorrect swap between send/recv datatypes.
  4. The support for datatypes with negative lower bounds was broken.
  5. Fix a logic error which resulted in consuming data out-of-order
  6. Fix cases where the convertor would refuse to use the last few bytes of a buffer.

While debugging an application which exposed (2) above, I found that the rank which first experienced the error tried to cleanly exit with an error return code. However doing so while other ranks are accessing the shared memory region exposed by SMSC caused all the other local ranks to segfault. This hampered my progress during debug. To address it, I have moved the barrier that we previously only passed during successful exits, to be passed during all exit conditions. This will hang the execution instead of segfaulting other local ranks. I think this is better but I welcome other perspectives here.

Copy link

github-actions bot commented Sep 9, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

e6c34d3: coll/han/allreduce: squelch a particularly noisy l...

  • check_signed_off: does not contain a valid Signed-off-by line

cb82581: coll/han/alltoallv: handle errors better

  • check_signed_off: does not contain a valid Signed-off-by line

e3e9e1f: coll/han/alltoallv: correct loop condition on empt...

  • check_signed_off: does not contain a valid Signed-off-by line

577968e: coll/han: alltoallv should fall-back if transform ...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@lrbison
Copy link
Contributor Author

lrbison commented Sep 11, 2024

A status update: These commits are good as they are, but I think I have found evidence of possible data corruption. I am working to identify and fix the issue, and then I'll add them to this PR.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a50519c: coll/han/alltoallv: Correct swapped send/recv data...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

mca_coll_han_alltoallv_using_smsc is not compatible with an in-place
transform.  Update comments and fall-back logic to catch this case.

Signed-off-by: Luke Robison <[email protected]>
This change makes the loop condition more straightforward, and
corrects a logic error that was previously seen when a rank had 0
bytes to receive.

With the new logic the first pass through the receive-positing section
will walk through all datatypes, and never post a recv buffer.  Sender
logic remains unchanged.

Signed-off-by: Luke Robison <[email protected]>
Previously if one rank returned an error, it would immediately exit
the collective with an error which likely causes that pid to
immediately exit with an error.  Unfortunately this causes immediate
segfaults as other ranks may still be accessing shared memory, often
causing cascading walls of segfault text.

I have moved the existing MPI_barrier so that all processes must pass
it before exit regardless of error condition.  This means that either
errors will hang, or all ranks will encounter an error.

Signed-off-by: Luke Robison <[email protected]>
Previously this function considered only displ, count, and extent.
Since the function uses XPMEM to explicitly expose memory regions, we
must also be aware of types that have negative lower bounds and might
access data _before_ the user-provided pointer.

This change more accurately compute the true upper and lower bounds of
all memory accesses, both to insure we don't try to map regions of
memory that may not be in our VM page table, and to ensure we expose
all memory that will be accessed.

Signed-off-by: Luke Robison <[email protected]>
The logic for waitany was flawed.  While we could wait for either a
send or a receive, we cannot consume receives in any order, and
likewise for sends.  Fix this by simply ping-ponging between waiting
for sends or receives and cycling when there is nothing to wait on.

Signed-off-by: Luke Robison <[email protected]>
@lrbison
Copy link
Contributor Author

lrbison commented Sep 20, 2024

@bosilca At long last this is ready for review again. Thank you for your patience. I have made a rather exhaustive test suite. I am trying to figure out the best place to add it. Perhaps to MTT I suppose. Any other suggestions?

Change to a greedy recv-posting algorithm: always pre-post and cancel
later.  This is motivated by the observation that the convertor does
not always use every byte in the buffer, which makes the previous
method of counting how many bytes will arrive next innacurate.

Signed-off-by: Luke Robison <[email protected]>
@lrbison lrbison changed the title coll/han/alltoallv Bugfix: empty messages coll/han/alltoallv Bugfixes Sep 20, 2024
@@ -132,7 +132,7 @@ mca_coll_han_allreduce_intra(const void *sbuf,
seg_count);

/* Determine number of elements sent per task. */
OPAL_OUTPUT_VERBOSE((10, mca_coll_han_component.han_output,
OPAL_OUTPUT_VERBOSE((30, mca_coll_han_component.han_output,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed rather noisy when I was trying to view my own logging. It has its own commit, maybe a separate PR would be better.

void *sbuf;
void *rbuf;
void *serialization_buffer;
};

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this sucks and all collectives have exactly the same issue. That's why we already have a function for it opal_datatype_span.

continue;
}
rc = ompi_request_wait(&requests[jreq], MPI_STATUS_IGNORE);
if (rc) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you use this everywhere. Safe programming is good, but can you please add some info about what went wrong instead of just breaking.

@@ -194,31 +270,43 @@ static inline int alltoallv_sendrecv_w_direct_for_debugging(
jrecvs_completed++;
}

requests[jreq] = &ompi_request_null.request;
if (ii_send_req && jsends_posted < ntypes_send) {
rc = ompi_datatype_create_contiguous( 1, (ompi_datatype_t*)send_types[jsends_posted], &yuck_ompi_dtype_from_opal );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you really called this datatype yuck !? It must be an acronym right ?

Copy link
Contributor Author

@lrbison lrbison Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to synthesize it from an opal datatype on-demand so that I could pass it to sendrecv. I can rename it. Note this function is not called unless the code is modified.

if (rc) { break; };
}
if (rc) {
opal_output_verbose(1, mca_coll_han_component.han_output,
"Failed in alltoallv_sendrecv_w_direct_for_debugging: jloop=%d, rc=%d\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what debugging ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function is inefficient and not used. However if you doubt that something is correct in all the business logic of the main sendrecv_w, then it is a easy code change to use this function to double-check.

Is there a better way to keep such a function but also keep it out of the way?

@@ -250,16 +338,20 @@ static int alltoallv_sendrecv_w(
buf_items[jbuf] = opal_free_list_get(&mca_coll_han_component.pack_buffers);
if (buf_items[jbuf] == NULL) {
nbufs = jbuf - 1;
printf("Uh-oh, not enough buffers: %d\n",nbufs);
opal_output_verbose(20, mca_coll_han_component.han_output,
"Uh-oh, not enough buffers: %d\n",nbufs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but maybe we can make this output a more user-friendly ?

We take path 2 here, to avoid possible race conditions between the
cancel and the possibility of a matching recv.

After realizing that the convertor may decide to not fully pack the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about the meaning of this ? The convertor might decide to avoid stopping in the middle of a predefined type. Is this what this comment is about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what I intended to say. I observed a strange type consisting of 1 char and 1 int, and the convertor refused to use the last 3 bytes of the buffer, which threw off my original counting scheme.

@lrbison
Copy link
Contributor Author

lrbison commented Sep 20, 2024

Thank you George. I will be on a vacation for the next week and then I will address your comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants