Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio}[foss/2023b] GROMACS v2024.3 #21430

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

bedroge
Copy link
Contributor

@bedroge bedroge commented Sep 17, 2024

(created using eb --new-pr)

Compared to previous easyconfigs, this now installs the pypi version of gmxapi. The versioning of the included gmxapi seems a bit confusing: https://gitlab.com/gromacs/gromacs/-/blob/v2024.3/python_packaging/gmxapi/pyproject.toml?ref_type=tags says 0.4.1, https://gitlab.com/gromacs/gromacs/-/blob/v2024.3/python_packaging/gmxapi/src/gmxapi/version.py?ref_type=tags shows 0.5.0a1, and the docs just recommend using the pypi version (where the latest version is 0.4.2).

@bedroge bedroge added the update label Sep 17, 2024
@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4887

Test results coming soon (I hope)...

- notification for comment with ID 2355567729 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Test report by @bedroge
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 24.04.1 LTS (Noble Numbat), x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.12.3
See https://gist.github.com/bedroge/00b77ed6bd3a5d428ec87908696c72e3 for a full test report.

edit: oops, forgot to include the fix from easybuilders/easybuild-easyblocks#3283, ran into that before...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/50ee5c7ff043d4ebd45c44d1b99799af for a full test report.

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on login1

PR test command 'EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14284

Test results coming soon (I hope)...

- notification for comment with ID 2355672238 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Sep 17, 2024

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/17e64042a902dfac5e5c3d9084e53233 for a full test report.

Test failure in GmxapiMpiTests:

File input/output error:
/tmp/boegelbot/GROMACS/2024.3/foss-2023b/easybuild_obj/api/gmxapi/cpp/tests/Testing/Temporary/GmxApiTest_RunnerChainedMD.trr

Let's try again...

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on login1

PR test command 'EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14285

Test results coming soon (I hope)...

- notification for comment with ID 2355854315 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Test report by @bedroge
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3283
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 24.04.1 LTS (Noble Numbat), x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.12.3
See https://gist.github.com/bedroge/3ef052bc3b70ea4b0c73ab4eb450ade9 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/d174a112ed863aa2ef88e1386ea67320 for a full test report.

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Also tested this with the EESSI bot for a bunch of CPUs: EESSI/software-layer#709. There it also failed on haswell with the same input/output error, so I've started another build.

@boegel
Copy link
Member

boegel commented Sep 17, 2024

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

@mabraham
Copy link

mabraham commented Sep 18, 2024

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8 See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

GROMACS dev here. I see that the following test case

[  FAILED  ] PropagatorsWithCoupling/PeriodicActionsTest.PeriodicActionsAgreeWithReference/21, where GetParam() = ({ ("comm-mode", "linear"), ("integrator", "md-vv"), ("maxGromppWarningsTolerated", "0"), ("nstcomm", "5"), ("nstpcouple", "3"), ("nsttcouple", "2"), ("pcoupl", "C-rescale"), ("simulationName", "argon12"), ("tcoupl", "v-rescale") }, 0x55e03938e489)

fails, either timing out or somehow suspended or crashed. C-rescale is a relatively new implementation, and this test case is intended to exercise dark corners of the code, so a real problem is possible.

Yet I see the preceding test case (at https://gist.github.com/boegel/75ff6503735f73f2d9ec570366bd181f#file-gromacs-2024-3-foss-2023b_partial-log-L374) took 25 seconds. On my x86 laptop with a release debug build the whole test suite takes under 4 seconds. Why is this GROMACS configuration so slow?

@boegel
Copy link
Member

boegel commented Sep 18, 2024

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8 See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

GROMACS dev here. I see that the following test case

[  FAILED  ] PropagatorsWithCoupling/PeriodicActionsTest.PeriodicActionsAgreeWithReference/21, where GetParam() = ({ ("comm-mode", "linear"), ("integrator", "md-vv"), ("maxGromppWarningsTolerated", "0"), ("nstcomm", "5"), ("nstpcouple", "3"), ("nsttcouple", "2"), ("pcoupl", "C-rescale"), ("simulationName", "argon12"), ("tcoupl", "v-rescale") }, 0x55e03938e489)

fails, either timing out or somehow suspended or crashed. C-rescale is a relatively new implementation, and this test case is intended to exercise dark corners of the code, so a real problem is possible.

Yet I see the preceding test case (at https://gist.github.com/boegel/75ff6503735f73f2d9ec570366bd181f#file-gromacs-2024-3-foss-2023b_partial-log-L374) took 25 seconds. On my x86 laptop with a release debug build the whole test suite takes under 4 seconds. Why is this GROMACS configuration so slow?

@mabraham It's probably not the GROMACS configuration itself, but the environment it's running it.

It's running in an interactive Slurm job, with 9 cores available (in a cgroup) out of a total of 36 in total on that system.
It's also an Intel Skylake system (Intel Xeon Gold 6140), which isn't exactly new.

In addition, $OMP_PROC_BIND is set to TRUE by default on that system (via a profile script).
In general, that should improve performance for OpenMP workloads, but we've seen that cause trouble before: for some software multi-threaded processes are being bound to a single core (that's definitely the case for R example), while they do start N threads, so threads are fighting for resources leading to very slow runs.
That's a known quirk of the GCC OpenMP runtime, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698

I've seen this before, but I never got to the bottom of it for GROMACS...

If any of this rings a bell, any insights you may have are welcome.

@mabraham
Copy link

The test cases are only using two pthreads, so if the system is working as you describe, there's no ready explanation of a problem. But if the core-to-cgroup mapping is not working right, such slowdowns are plausible. Do you have / can you get data to observe core occupancy across a loaded node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants