Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T4 lysozyme example with implicit solvent runs out of memory when lots of memory appears to be available #1276

Open
therealchrisneale opened this issue May 20, 2022 · 0 comments

Comments

@therealchrisneale
Copy link

8 processes works OK: While running mpiexec.hydra -np 8 yank script --yaml=p-xylene-implicit.yaml:
bash-4.2$ free
total used free shared buff/cache available
Mem: 131934588 5161320 114950568 1014712 11822700 124444344
Swap: 0 0 0

20 processes gives an error: While running with mpiexec.hydra -np 20 yank script --yaml=p-xylene-implicit.yaml, just before failure:
bash-4.2$ free
total used free shared buff/cache available
Mem: 131934588 6578724 113531156 1019564 11824708 123022088
Swap: 0 0 0

The first error message and surrounding text were:
<…snip…>
2022-05-20 13:23:05,043: WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
Traceback (most recent call last):
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 411, in call_constructor
obj = subcls(**constructor_kwargs)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/replicaexchange.py", line 217, in init
super(ReplicaExchangeSampler, self).init(**kwargs)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/multistatesampler.py", line 203, in init
self._display_cuda_devices()
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/multistatesampler.py", line 1772, in _display_cuda_devices
cuda_query_output = os.popen("nvidia-smi --query-gpu=index,gpu_name --format=csv,noheader").read().strip()
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/os.py", line 980, in popen
bufsize=buffering)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/subprocess.py", line 1295, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/bin/yank", line 10, in
sys.exit(main())
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/cli.py", line 73, in main
dispatched = getattr(commands, command).dispatch(command_args)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/commands/script.py", line 155, in dispatch
yaml_builder.run_experiments(write_status=write_status)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 747, in run_experiments
group_size = self._get_experiment_mpi_group_size(all_experiments)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2862, in _get_experiment_mpi_group_size
sampler_names = {self._create_experiment_sampler(exp[1], []).class.name for exp in experiments}
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2862, in
sampler_names = {self._create_experiment_sampler(exp[1], []).class.name for exp in experiments}
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2990, in _create_experiment_sampler
return schema.call_sampler_constructor(constructor_description)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 470, in call_sampler_constructor
special_conversions=special_conversions)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 413, in call_constructor
raise RuntimeError('Attempt to initialize failed with: {}'.format(str(e)))
RuntimeError: Attempt to initialize failed with: [Errno 12] Cannot allocate memory
2022-05-20 13:23:05,054: CRITICAL - mpiplus.mpiplus - MPI node 1/20 raised an exception and called Abort()! The exception traceback follows
<…snip…>

For what it's worth, I get an entirely different error with -np 25 (so perhaps I am just running things incorrectly since I count 25 lambda values for the complex system):

<...snip...>
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6939 RUNNING AT ba173
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant