Replies: 1 comment 1 reply
-
It seems to me you got an out-of-memory error, so the GPU may not be used.
By the way, it looks strange that your two cards have different memory. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear developers,
Here, I have compiled the deepmd-kit and lammps using the following commands. However, the molecular dynamics (MD) speed is only 25% compared to when I use a conda installation directly. Since I have utilized a modified plumed, I had to compile them myself. Therefore, I kindly request your assistance in identifying and addressing the underlying issue.
Installation commands:
conda create -n cuda11
conda activate cuda11
conda install python==3.11.5
conda install cuda-nvcc
pip install --upgrade pip
pip install nvidia-cudnn-cu11==8.6.0.163 protobuf==4.23.4 tensorflow==2.13.*
#open a new terminal
CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.file)"))
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/usergpu/soft/anaconda/install/envs/cuda11/lib/:$CUDNN_PATH/lib
## deepmd-kit
tar
cd source
mkdir build
cd build
export PATH=/home/usergpu/soft/cmake/cmake-3.30.0-rc2-linux-x86_64/bin:$PATH
cmake -DUSE_TF_PYTHON_LIBS=TRUE -DCMAKE_INSTALL_PREFIX=/home/usergpu/soft/deepmd-kit/install/ -DTENSORFLOW_ROOT=/home/usergpu/soft/anaconda/install/envs/cuda11/lib/python3.11/site-packages/tensorflow/ ..
make -j12
make install -j12
make lammps
lammps
cd lammps-stable_2Aug2023_update2/
cd src/
cp -r /home/usergpu/soft/deepmd-kit/deepmd-kit-2.2.7/source/build/USER-DEEPMD/ .
make yes-kspace
make yes-extra-fix
make yes-user-deepmd
source /home/usergpu/soft/plumed-2.8.1/sourceme.sh
make lib-plumed args='-p /home/usergpu/xyliu/soft/plumed-2.8.1/bilud/ -m shared'
make yes-user-deepmd
make mpi -j 12
the output in screen when I submit a task
2024-06-18 20:02:33.307208: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.2024-06-18 20:02:33.344105: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
.DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:34.312498: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:34.333658: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-18 20:02:35.269270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:35.276366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:35.289328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:35.291170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:35.325452: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-06-18 20:02:35.358886: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled
2024-06-18 20:02:35.508553: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 36.02GiB (38673055744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.512437: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 32.42GiB (34805747712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.516274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 29.17GiB (31325171712 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.520040: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 26.26GiB (28192653312 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.524024: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 23.63GiB (25373386752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.528826: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 21.27GiB (22836047872 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.534870: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 19.14GiB (20552441856 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.540422: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 17.23GiB (18497198080 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.544968: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 15.50GiB (16647478272 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.548758: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 13.95GiB (14982729728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.552603: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 12.56GiB (13484455936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.556746: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 11.30GiB (12136009728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.562263: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 10.17GiB (10922408960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.567612: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 9.15GiB (9830167552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.571641: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 8.24GiB (8847150080 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.576848: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 7.42GiB (7962435072 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.580606: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 6.67GiB (7166191616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.584758: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 6.01GiB (6449572352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.588573: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 5.41GiB (5804615168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.594112: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 4.87GiB (5224153600 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.599475: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 4.38GiB (4701737984 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.603644: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.94GiB (4231564032 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.607448: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.55GiB (3808407552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.611319: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 3.19GiB (3427566592 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.615339: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.87GiB (3084809728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.620897: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.58GiB (2776328704 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.626355: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.33GiB (2498695680 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.631273: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 2.09GiB (2248826112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.635073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.88GiB (2023943424 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.638871: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.70GiB (1821549056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-06-18 20:02:35.643169: I tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:753] failed to allocate 1.53GiB (1639394048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-18 20:02:36.492733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.494282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.497800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.499288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.532222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.549077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
2024-06-18 20:02:36.558621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 36881 MB memory: -> device: 0, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:18:00.0, compute capability: 8.0
2024-06-18 20:02:36.560058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77825 MB memory: -> device: 1, name: NVIDIA A800 80GB PCIe, pci bus id: 0000:86:00.0, compute capability: 8.0
log. file
Beta Was this translation helpful? Give feedback.
All reactions