accelerate launch stuck forever #3073

wd255 · 2024-09-03T21:20:48Z

System Info

Running accelerate launch on a linux server with 10 4090 cards. Env details:
- `Accelerate` version: 0.33.0
- Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/anaconda3/envs/new_magvit/bin/accelerate
- Python version: 3.12.4
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 629.62 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: True
        - num_processes: 10
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3,4,5,6,7,8,9
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

conda create -n my_env
conda activate my_env
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
conda install -c conda-forge accelerate
run accelerate launch test.py. test.py could be an existing or non-existing file, the command halts before reach where the file matters. That being said, the test.py I used is

import torch
import torch.nn as nn
from accelerate import Accelerator

if __name__ == "__main__":
    accelerator = Accelerator()
    model = nn.Conv2d(10, 20, 3, 1, 1)
    model = accelerator.prepare(model)

Expected behavior

Expected behavior is if test.py exists, it gets executed.
The actual behavior is the program halts forever after running accelerate launch, without printing anything. Ctrl-C cannot kill it, I have to Ctrl-Z. Then there's still something running on port 29500 so next time I have to kill it manually.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate launch stuck forever #3073

accelerate launch stuck forever #3073

wd255 commented Sep 3, 2024 •

edited

Loading

accelerate launch stuck forever #3073

accelerate launch stuck forever #3073

Comments

wd255 commented Sep 3, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

wd255 commented Sep 3, 2024 •

edited

Loading