Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.0.4 and newer -- LSF Affinity hostfile bug #12794

Open
zerothi opened this issue Sep 5, 2024 · 20 comments
Open

5.0.4 and newer -- LSF Affinity hostfile bug #12794

zerothi opened this issue Sep 5, 2024 · 20 comments

Comments

@zerothi
Copy link
Contributor

zerothi commented Sep 5, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem).
I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident
that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From the sources. A little bit of ompi_info -c info:

 Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-lsf=/lsf/10.1'
                          '--with-lsf-libdir=/lsf/10.1/linux3.10-glibc2.17-x86_64/lib'
                          '--without-tm' '--enable-mpi-fortran=all'
                          '--with-hwloc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--enable-orterun-prefix-by-default'
                          '--with-ucx=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-ucc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-knem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--without-verbs' 'FCFLAGS=-O3 -march=haswell
                          -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32' 'CFLAGS=-O3
                          -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          'CXXFLAGS=-O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          '--with-ofi=no'
                          '--with-libevent=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          'LDFLAGS=-L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt'
                          '--with-xpmem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'

And env-vars:

            Build CFLAGS: -DNDEBUG -O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
                          -finline-functions
           Build FCFLAGS: -O3 -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
           Build LDFLAGS: -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
              Build LIBS: -levent_core -levent_pthreads -lhwloc
                          /tmp/sebo3-gcc-13.3.0-binutils-2.42/openmpi-5.0.3/3rd-party/openpmix/src/libpmix.la

Version numbers are of course different for 5.0.5, otherwise the same.

Please describe the system on which you are running

  • Operating system/version:

    Alma Linux 9.4

    $> cat /proc/version
    Linux version 6.1.106-1.el9.elrepo.x86_64 (mockbuild@83178ea248724ccf8c107949ffbafbc2) (gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-43.el9) #1 SMP PREEMPT_DYNAMIC Mon Aug 19 02:01:39 EDT 2024
  • Computer hardware:

    Tested on various hardware, both with and without hardware threads (see below).

  • Network type:
    Not relevant, I think.


Details of the problem

The problem relates to the interaction between LSF and OpenMPI.

A couple of issues that are shown here.

Bug introduced between 5.0.3 and 5.0.5

I encounter problems running simple programs (hello-world) in a multinode configuration:

$> bsub -n 8 -R "span[ptile=2]" ... < run.bsub

$> cat run.bsub
...
mpirun --report-bindings a.out

This will run on 4 nodes, each using 2 cores.

Output from:

  • 5.0.3:

    [n-62-28-31:793074] Rank 0 bound to package[1][hwt:14]
    [n-62-28-31:793074] Rank 1 bound to package[1][hwt:15]
    [n-62-28-28:3418906] Rank 2 bound to package[1][hwt:12]
    [n-62-28-28:3418906] Rank 3 bound to package[1][hwt:13]
    [n-62-28-29:1577632] Rank 4 bound to package[1][hwt:12]
    [n-62-28-29:1577632] Rank 5 bound to package[1][hwt:13]
    [n-62-28-30:53375] Rank 6 bound to package[1][hwt:12]
    [n-62-28-30:53375] Rank 7 bound to package[1][hwt:13]

    This looks reasonable. And LSF affinity file corresponds to this binding.

    Note, that these nodes does not have hyper-threading enabled.
    So our guess is that LSF always puts affinity for HWT, which is OK.
    It still obeys the default core binding which is what our end-users
    would expect.

  • 5.0.5

    [n-62-28-31:793073:0:793073] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x34)
    ==== backtrace (tid: 793073) ====
     0 0x000000000003e6f0 __GI___sigaction()  :0
     1 0x00000000000e94f4 prte_rmaps_rf_lsf_convert_affinity_to_rankfile()  rmaps_rank_file.c:0
     2 0x00000000000e8fc1 prte_rmaps_rf_process_lsf_affinity_hostfile()  rmaps_rank_file.c:0
     3 0x00000000000e684e prte_rmaps_rf_map()  rmaps_rank_file.c:0
     4 0x00000000000da965 prte_rmaps_base_map_job()  ???:0
     5 0x0000000000027cf9 event_process_active_single_queue()  event.c:0
     6 0x000000000002856f event_base_loop()  ???:0
     7 0x000000000040761a main()  ???:0
     8 0x0000000000029590 __libc_start_call_main()  ???:0
     9 0x0000000000029640 __libc_start_main_alias_2()  :0
    10 0x0000000000407b05 _start()  ???:0
    =================================

    Clearly something went wrong when parsing the affinity hostfile.

    The hostfile looks like this (for both 5.0.3 and 5.0.5):

    $> cat $LSB_AFFINITY_HOSTFILE
    n-62-28-31 16
    n-62-28-31 17
    n-62-28-28 14
    n-62-28-28 15
    n-62-28-29 14
    n-62-28-29 15
    n-62-28-30 14
    n-62-28-30 15

    (different job, hence different nodes/ranks)

So the above, indicates some regression for this handling. I tried to backtrack
something from prrte, but I am not skilled enough for the logic happening there.

I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:

So my suspicion is that also 5.0.4 has this.

Now, these things are relatively easily fixed.

I just do:

unset LSB_AFFINITY_HOSTFILE

and rely on cgroups. Then I get the correct behaviour.
Correct bindings etc.

By unsetting, I also fallback to the default OpenMPI binding:

  • 5.0.3

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-28-31 18
    n-62-28-31 19
    n-62-28-28 16
    n-62-28-28 17
    n-62-28-29 16
    n-62-28-29 17
    n-62-28-30 16
    n-62-28-30 17
    (ompi binding)
    [n-62-28-31:793075] Rank 0 bound to package[1][core:18]
    [n-62-28-31:793075] Rank 1 bound to package[1][core:19]
    [n-62-28-28:3418905] Rank 2 bound to package[1][core:16]
    [n-62-28-28:3418905] Rank 3 bound to package[1][core:17]
    [n-62-28-29:1577633] Rank 4 bound to package[1][core:16]
    [n-62-28-29:1577633] Rank 5 bound to package[1][core:17]
    [n-62-28-30:53374] Rank 6 bound to package[1][core:16]
    [n-62-28-30:53374] Rank 7 bound to package[1][core:17]

    Note here that it says core instead of hwt.

  • 5.0.5

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-28-28 18
    n-62-28-28 19
    n-62-28-29 18
    n-62-28-29 19
    n-62-28-30 18
    n-62-28-30 19
    n-62-28-33 12
    n-62-28-33 13
    (ompi binding)
    [n-62-28-28:3418897] Rank 0 bound to package[1][core:18]
    [n-62-28-28:3418897] Rank 1 bound to package[1][core:19]
    [n-62-28-29:1577625] Rank 2 bound to package[1][core:18]
    [n-62-28-29:1577625] Rank 3 bound to package[1][core:19]
    [n-62-28-33:2083367] Rank 7 bound to package[1][core:13]
    [n-62-28-33:2083367] Rank 6 bound to package[1][core:12]
    [n-62-28-30:53366] Rank 4 bound to package[1][core:18]
    [n-62-28-30:53366] Rank 5 bound to package[1][core:19]

    So same thing happens, good!

Nodes with HW threads

This is likely related to the above, I just put it here for completeness.

As mentioned above I can do unset LSB_AFFINITY_HOSTFILE and get correct bindings.

However, the above works only when there are no HWT.

Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)

Only requesting 4 cores here.

  • 5.0.3

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-12-14 4,68
    n-62-12-14 5,69
    n-62-12-15 4,68
    n-62-12-15 5,69
    (ompi binding)
    [n-62-12-14:202682] Rank 0 bound to package[0][core:4]
    [n-62-12-14:202682] Rank 1 bound to package[0][core:5]
    [n-62-12-15:1179019] Rank 2 bound to package[0][core:4]
    [n-62-12-15:1179019] Rank 3 bound to package[0][core:5]

    This looks OK. Still binding to the cgroup cores.

  • 5.0.5

    $> cat $LSB_AFFINITY_HOSTFILE ; unset LSB_AFFINITY_HOSTFILE
    n-62-12-14 6,70
    n-62-12-14 7,71
    n-62-12-15 6,70
    n-62-12-15 7,71
    (ompi binding)
    [n-62-12-14:202680] Rank 0 bound to package[0][core:0]
    [n-62-12-14:202680] Rank 1 bound to package[0][core:1]
    [n-62-12-15:1179020] Rank 2 bound to package[0][core:0]
    [n-62-12-15:1179020] Rank 3 bound to package[0][core:1]

    This looks bad, wrong core binding, should have been 6,7 on both nodes.

If you need more information, let me know!

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2024

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

@zerothi
Copy link
Contributor Author

zerothi commented Sep 5, 2024

As for LSF support... damn...

As for rankfile mapping.
I got it through:

hwloc-gather-topology test

I have never done that before, so let me know if that is correct?

(couldn't upload xml files, had to be compressed).
test.xml.gz

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2024

As for LSF support... damn...

Best I can suggest is that you contact IBM through your LSF contract support and point out that if they want OMPI to function on LSF going forward, they probably need to put a little effort into supporting it. 🤷‍♂️

XML looks fine - thanks! Will update as things progress.

@sb22bs
Copy link

sb22bs commented Sep 6, 2024

prrte @ 42169d1cebf75318ced0306172d3a452ece13352 is the last good one,
prrte @ f297a9e2eb96c2db9d7756853f56315ea5a127cd seems to break it (at least in our setup).

@sb22bs
Copy link

sb22bs commented Sep 7, 2024

workaround:
export HWLOC_ALLOW=all
:-)

@rhc54
Copy link
Contributor

rhc54 commented Sep 7, 2024

Ouch - I would definitely advise against doing so. It might work for a particular application, but almost certainly will cause breakage in general.

@fabiosanger
Copy link

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

but the documentation is still suggesting that open MPI could be built with lsf support

@rhc54
Copy link
Contributor

rhc54 commented Sep 9, 2024

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

@fabiosanger
Copy link

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

@fabiosanger
Copy link

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

thanks you

@rhc54
Copy link
Contributor

rhc54 commented Sep 9, 2024

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

They seem to indicate that v5.0.3 is working, but all the v5.0.x appear to at least build for them.

@zerothi
Copy link
Contributor Author

zerothi commented Sep 9, 2024

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

@fabiosanger
Copy link

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

@fabiosanger
Copy link

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

@zerothi
Copy link
Contributor Author

zerothi commented Sep 9, 2024

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

You could see if there are some things important that we have in the configure step (see initial message).

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

I don't know what could be the issue. But I think it shouldn't be cluttered here, rather open a new issue IMHO. This issue is deeper (not a build issue).

@fabiosanger
Copy link

I did open a ticket

@rhc54
Copy link
Contributor

rhc54 commented Sep 9, 2024

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

@bgoglin
Copy link
Contributor

bgoglin commented Sep 9, 2024

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

@rhc54
Copy link
Contributor

rhc54 commented Sep 10, 2024

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

@bgoglin Hmmm...we removed this code from PRRTE:

        flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;

because we only want the topology to contain the CPUs the user is allowed to use (note: all CPUs will still be in the complete_cpuset field if we need them - we use the return from hwloc_topology_get_allowed_cpuset). If the topology includes all CPUs (which is what happens when we include the above line of code), then we wind up thinking we can use them, which messes up the mapping/binding algorithm. So what I need is a way of not allowing the user to override that requirement by setting this envar. Might help a particular user in a specific situation, but more generally causes problems.

I'll work out the issue for LSF as a separate problem - we don't see problems elsewhere, so it has something to do with what LSF is doing. My question for you is: how do I ensure the cpuset returned by get_allowed_cpuset only contains allowed CPUs, which is what PRRTE needs?

@bgoglin
Copy link
Contributor

bgoglin commented Sep 10, 2024

Just ignore this corner-case. @sb22bs said using this envvar is a workaround. It was designed for strange buggy cases, eg when cgroups are misconfigured. I can try to better document that this envvar is bad idea unless you really know what you are doing. Just consider that get_allowed_cpuset() is always correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants