Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMIx Fence: single-job wildcard barrier WITH timeout #71

Open
artpol84 opened this issue Feb 10, 2021 · 3 comments
Open

PMIx Fence: single-job wildcard barrier WITH timeout #71

artpol84 opened this issue Feb 10, 2021 · 3 comments
Assignees
Labels
Unit Test Spec Unit Test Specification

Comments

@artpol84
Copy link

artpol84 commented Feb 10, 2021

Test description

Verifies that the Fence is synchronizing

Test sketch

#include "pmix.h"

double max_fence_time()
{
	double fence_time = 0;
	int i;
	
	/* Measure the typical fence execution time */
	for(i = 0; i < 100; i++) {
		ts1 = timestamp();
		PMIx_Fence(without_data_collection);
		ts2 = timestamp();
		fence_time = max(fence_time, ts2 - ts1);
	}
	return fence_time;
}

int main() {
    double timeout, fence_time;
	
    PMIx_Init();
	
    fence_time = max_fence_time();
    T = Ratio * fence_time; // Ratio might be 100, should be selected for the particular system    
   
    PMIx_Fence(without_data_collection);
    if( rank == 0){
        sleep(T);
    }
    ts1 = timestamp();
    rc = PMIx_Fence(without_data_collection, timeout = T/2);
    ts2 = timestamp();
    assert(rc == PMIX_ERR_TIMEOUT);
    assert((t2 - t1) ~ (T/2));
    PMIx_Finalize();
}

Execution details

  • 4 servers
  • 16 clients
  • Predefined (passed through cmdline) namespace
  • Predefined process placement: "0:0,1,2,3; 1:4,5,6,7; 2:8,9,10,11; 3:12,13,14,15;"
  • Ratio and "~" are selected to match the system
    • The time-dependant checks can be turned off
  • Execute M times to capture race conditions
  • The first rank is simulating the delay. The test verifies that the Fence is really synchronizing;

Client-side expectations:

  1. All PMIx calls return PMIX_SUCCESS
  2. All ranks (except rank=0) experience Fence timeout.

Server-side expectations:

  1. N invocations of:
  • client_connected
  • client_finalized
  1. Verify, that proc structure was set to the individual ranks.
  2. 2 Fence callback invocation with WILDCARD.
  3. Distance between Fence's on node0 is > T
  4. Starting from "modex: avoid exchange unnecessary buffer when collect flag is not set openpmix#1135" the size of Fence should be 0B.
  5. No other callbacks are called (no direct modex requests)
  6. The timeout should be observed and the RTE server has to act accordingly informing the PMIx server about it.
    (? Any event-related activity?)

Reference implementation:

TBD

Notes

The test suite's RTE component should implement the support of the PMIX_TIMEOUT info key in pmix_server_fencenb_fn_t callback.
Currently, it's not there.

@artpol84 artpol84 added the Unit Test Spec Unit Test Specification label Feb 10, 2021
@jjhursey
Copy link
Member

This looks good to me. The only improvement that I would suggest is after the assert in the snippet below, add a check that the time between ts1 and ts2 is at least T/2 but no more than T. Just to catch the case that the timeout occurred immediately.

    ts1 = timestamp();
    rc = PMIx_Fence(without_data_collection, timeout = T/2);
    ts2 = timestamp();
    assert(rc == PMIX_ERR_TIMEOUT);

@artpol84
Copy link
Author

Thank you, Josh. Done that!

@cpshereda
Copy link
Contributor

@artpol84 and @jjhursey : See openpmix/openpmix#2098.

Note that for now running this test with more than two clients that timeout will fail - see this issue: openpmix/openpmix#2096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unit Test Spec Unit Test Specification
Projects
None yet
Development

No branches or pull requests

3 participants