Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_split_type not creating disjoint subgroups for certain cases #12812

Open
mshanthagit opened this issue Sep 16, 2024 · 27 comments
Open

Comments

@mshanthagit
Copy link
Contributor

mshanthagit commented Sep 16, 2024

Thank you for taking the time to submit an issue!

Background information

MPI_Comm_split_type is creating groups with overlapping ranks in certain cases where ranks are bound to cores across resource domains (say L3). For example, consider the following where ranks 2 and 5 share two L3 domains.
(program using MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info, &newcomm); on a Genoa machine with 8 cores per L3)

mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out
[electra016:1374129] Rank 0 bound to package[0][core:0-2]
[electra016:1374129] Rank 1 bound to package[0][core:3-5]
[electra016:1374129] Rank 2 bound to package[0][core:6-8]
[electra016:1374129] Rank 3 bound to package[0][core:9-11]
[electra016:1374129] Rank 4 bound to package[0][core:12-14]
[electra016:1374129] Rank 5 bound to package[0][core:15-17]
[electra016:1374129] Rank 6 bound to package[0][core:18-20]
[electra016:1374129] Rank 7 bound to package[0][core:21-23]
Hello --- my rank: 0, my comm_size: 8
Hello --- my rank: 1, my comm_size: 8
Hello --- my rank: 7, my comm_size: 8
Hello --- my rank: 6, my comm_size: 8
Hello --- my rank: 5, my comm_size: 8
Hello --- my rank: 4, my comm_size: 8
Hello --- my rank: 3, my comm_size: 8
Hello --- my rank: 2, my comm_size: 8
From split comm: my rank: 0, my split_comm_size: 3
From split comm: my rank: 2, my split_comm_size: 6
From split comm: my rank: 4, my split_comm_size: 4
From split comm: my rank: 6, my split_comm_size: 3
From split comm: my rank: 1, my split_comm_size: 3
From split comm: my rank: 3, my split_comm_size: 4
From split comm: my rank: 5, my split_comm_size: 6
From split comm: my rank: 7, my split_comm_size: 3

As we can see from the above, there are only two ranks with comm_size 6! Although it doesn't print out the ranks within each communicator, here's what it would be:

comm(0): 0, 1, 2
comm(1): 0, 1, 2
comm(2): 0, 1, 2, 3, 4, 5
comm(3): 2, 3, 4, 5
comm(4): 2, 3, 4, 5
comm(5): 2, 3, 4, 5, 6, 7
comm(6): 5, 6, 7
comm(7): 5, 6, 7

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I tested with 5.0.x and 4.1.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source (5.0.x)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Details in the background section. Here is an example program:

#include <stdlib.h>
#include <stdio.h>
#include "mpi.h"


int main (int argc, char **argv) {

   MPI_Init(&argc, &argv);
   
   int rank, size, comm_size, newcomm_size;
   int status = 0;
   
   MPI_Comm newcomm;
   MPI_Info info;
   
   // Get the number of MPI processes:
   MPI_Comm_size(MPI_COMM_WORLD, &size);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   
   printf("Hello --- my rank: %d, my comm_size: %d\n", rank, size);
   
    MPI_Info_create(&info);
   
   status = MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info,  &newcomm);
   
   if (status) {
   	printf("Error in comm split %d\n", status);
   }
   
   MPI_Comm_size(newcomm, &newcomm_size);
   printf("From split comm: my rank: %d, my split_comm_size: %d\n", rank, newcomm_size);

   MPI_Finalize();

   return status;
} 

====================

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out    (on a Genoa machine with 3CCDs per numa and 8 cores per CCD)
@bosilca
Copy link
Member

bosilca commented Sep 16, 2024

MPI_Comm_split_type creates disjoint communicators, which means the sum of the sizes of the sub-communicators should be 8 (the size of your MPI_COMM_WORLD).

Please provide the output of hwloc-ls to see how the cores numbered on the node.

@edgargabriel
Copy link
Member

The way it looks like based on the output, is that rank 2 (which is bound to cores spanning two L3 domains, since cores 0-7 are on L3 domain 1, cores 8-15 are on L3 domain 2, etc.) created a communicator that includes all ranks that have also been bound to either the 1st or the 2nd L3 domain (which is why it ends up with 6 rank). On the other hand, rank 0 is only bound to cores on the first L3 domain, and it only includes ranks that have been bound to that L3 domain.

@edgargabriel
Copy link
Member

$ hwloc-ls
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (32MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#48)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#49)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#50)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#51)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#52)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#53)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#54)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#55)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#56)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#57)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#58)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#59)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#60)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#61)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#62)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#63)
    L3 L#2 (32MB)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#64)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#65)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#66)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#67)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#68)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#69)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#70)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#71)
...
  Package L#1
    NUMANode L#1 (P#1 126GB)
    L3 L#3 (32MB)
      L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#72)
      L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#73)
      L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#74)
      L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#75)
      L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#76)
      L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#77)
      L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#78)
      L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#79)
    L3 L#4 (32MB)
      L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
        PU L#64 (P#32)
        PU L#65 (P#80)
      L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
        PU L#66 (P#33)
        PU L#67 (P#81)
      L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
        PU L#68 (P#34)
        PU L#69 (P#82)
      L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
        PU L#70 (P#35)
        PU L#71 (P#83)
      L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
        PU L#72 (P#36)
        PU L#73 (P#84)
      L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
        PU L#74 (P#37)
        PU L#75 (P#85)
      L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
        PU L#76 (P#38)
        PU L#77 (P#86)
      L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
        PU L#78 (P#39)
        PU L#79 (P#87)
    L3 L#5 (32MB)
      L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
        PU L#80 (P#40)
        PU L#81 (P#88)
      L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
        PU L#82 (P#41)
        PU L#83 (P#89)
      L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
        PU L#84 (P#42)
        PU L#85 (P#90)
      L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
        PU L#86 (P#43)
        PU L#87 (P#91)
      L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
        PU L#88 (P#44)
        PU L#89 (P#92)
      L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
        PU L#90 (P#45)
        PU L#91 (P#93)
      L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
        PU L#92 (P#46)
        PU L#93 (P#94)
      L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
        PU L#94 (P#47)
        PU L#95 (P#95)

@bosilca
Copy link
Member

bosilca commented Sep 16, 2024

Funny, that's kind of what I suspected based on the published documentation of the Genoa architecture but with the topo things would have been more clear.

OMPI current split by type code uses topology masks (with an or operation) which explains why the ranks mapped across multiple levels think they belong to a larger group. This also means OMPI can only supports symmetric cases, a sensible approach from my perspective.

Indeed, taking in account that split creates disjoint communicators, what should we expect from a non-symmetric case like the one here ? One potential outcome could be to eliminate processes spanning across multiple domains from the split operation and return something like this:

comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7

In this case it is not obvious to users why some processes (here 2 and 5) are not part of a larger set.

@mshanthagit
Copy link
Contributor Author

Is it still possible to create disjoint communicators for non-symmetric cases, belonging to a specific communicator based on some rule, say "I only belong to the first domain (in some ordering)"?

@edgargabriel
Copy link
Member

@mshanthagit that is also what I was about to suggest :-)

@bosilca I agree that there is no clear and good solution in this scenario, so the primary goal has to be that it is consistent across processes. That being said, having processes being just by themselves in a comm probably is not desirable.

Would it make sense to have a rule along the lines: if a process is part of multiple L3 domains, the Comm_split_type will be applied as if it would only be part of the first domain that it is part of?

@bosilca
Copy link
Member

bosilca commented Sep 16, 2024

What is the first domain ? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?

I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.

@edgargabriel
Copy link
Member

What is the first domain ? One could argue that it should belong to the location where it has the most resources bound to ? Or where are less resources bound on the location ?

If we can easily figure out where most of its resources are bound to, sure. But otherwise, I would define the 'first domain' as the domain of the first core that it is bound to.

I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well.

yes, that's how we found the issue, from the usage in a collective component :-)

@bosilca
Copy link
Member

bosilca commented Sep 17, 2024

The simple fix, aka strict matching, is relatively easy to do by changing hwloc_bitmap_intersects to hwloc_bitmap_isequal in opal_hwloc_compute_relative_locality (opal/mca/hwloc/base/hwloc_base_util.c:638). The more complex matching that @edgargabriel proposed will require a complete rewrite of opal_hwloc_compute_relative_locality and a different storage of the locality in PMIX and opal_proc_t.

@mshanthagit
Copy link
Contributor Author

mshanthagit commented Sep 17, 2024

@bosilca what does strict matching do? Say in the example above, how does the split happen?

@bosilca
Copy link
Member

bosilca commented Sep 17, 2024

Strictly the same binding at a specific level. I gave the outcome for the example provided here few comments above. On Genoa with the bindings provided by the user (-np 8 --map-by ppr:8:numa:pe=3), the split by type for L3 locality you will return:

comm(0): 0, 1
comm(1): 0, 1
comm(2): MPI_COMM_SELF
comm(3): 3, 4
comm(4): 3, 4
comm(5): MPI_COMM_SELF
comm(6): 6, 7
comm(7): 6, 7

@edgargabriel
Copy link
Member

@bosilca this is not consistent, comm(2) cannot think that it is alone in the communicator, while comm(0) and comm(1) think that it is part of their subgroup

@mshanthagit
Copy link
Contributor Author

mshanthagit commented Sep 17, 2024

@bosilca I think it could lead to incorrect behavior (pardon me if I am wrong). Say I do an allreduce with the new comm after split, what's the behavior? What will rank 2 have?

Ignore, I thought I saw 2 in other comms

@bosilca
Copy link
Member

bosilca commented Sep 17, 2024

@mshanthagit as @edgargabriel noticed my example was incorrect. It should now be fixed: Rank 2 and 5 will be alone in their own communicator.

@edgargabriel
Copy link
Member

Maybe we should do a 2-step process: first step is to add the simple solution that @bosilca suggested, and backport it to 5.0.x and 4.1.x. There is clearly value in this solution, since it fixes an inconsistency.

If somebody has the cycles to work on my suggested approach, we can still do that at a later stage, maybe for 6.0. (There is a good chance that it might not happen though, in which case we have however at least the first fix)

@mshanthagit
Copy link
Contributor Author

Just thinking out loud, the solution @bosilca suggested creates 5 communicators whereas one would expect 3 (as there are 3 L3 domains). Will there be any side effects?

@bosilca
Copy link
Member

bosilca commented Sep 17, 2024

Honestly, I think that users binding processes as in the example here (overlapping several domains), deserve what they get and any split type is good, for as long as it is consistent across the board. The strict mode has the advantage of being a one-liner.

@edgargabriel
Copy link
Member

edgargabriel commented Sep 17, 2024

I don't necessarily disagree with you @bosilca I would just caution that sometimes these decisions are not dominated by MPI requirements. In this particular instance, it was the compute performance that was significantly better than using --map-by ppr:8:numa:pe=2 or --map-by ppr:8:numa:pe=4 that was driving the decision, and we just have to find a way to make the best out of that on the MPI side

@rhc54
Copy link
Contributor

rhc54 commented Sep 17, 2024

Or maybe a different mapping pattern? Not sure exactly what you are trying to achieve, but seems like mapping to L3 is something we have already enabled, so I'm a tad confused.

@edgargabriel
Copy link
Member

I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains

@bosilca
Copy link
Member

bosilca commented Sep 17, 2024

There is only so much we can do automatically in the MPI library. For everything else, the users can fall back to either a manual MPI_Comm_split or to a guided split. However, for the more precise MPI_COMM_TYPE_HW_GUIDED split the MPI standard clarifies:

The MPI processes in the group associated with the output communicator newcomm utilize that specific hardware resource type instance, and no other instance of the same hardware resource type.

@edgargabriel
Copy link
Member

@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?

@rhc54
Copy link
Contributor

rhc54 commented Sep 17, 2024

I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains

Yeah, we do that frequently - can you post your topology so I can give you the correct cmd line?

@edgargabriel
Copy link
Member

@rhc54 the output of hwloc-ls is listed above on the ticket, is that what you are looking for or do you need additional information?

@rhc54
Copy link
Contributor

rhc54 commented Sep 18, 2024

I need the actually topology file output - the XML output so I can use it as input to PRRTE.

@bosilca
Copy link
Member

bosilca commented Sep 18, 2024

@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL?

Yes, we can return MPI_COMM_NULL instead of MPI_COMM_SELF for the processes that are not strictly bound to a single instance of the specified resource.

@edgargabriel
Copy link
Member

I need the actually topology file output - the XML output so I can use it as input to PRRTE.

@rhc54 I pingged you on slack for that, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants