Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate launch stuck forever #3073

Open
2 of 4 tasks
wd255 opened this issue Sep 3, 2024 · 0 comments
Open
2 of 4 tasks

accelerate launch stuck forever #3073

wd255 opened this issue Sep 3, 2024 · 0 comments

Comments

@wd255
Copy link

wd255 commented Sep 3, 2024

System Info

Running accelerate launch on a linux server with 10 4090 cards. Env details:
- `Accelerate` version: 0.33.0
- Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/anaconda3/envs/new_magvit/bin/accelerate
- Python version: 3.12.4
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 629.62 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: True
        - num_processes: 10
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2,3,4,5,6,7,8,9
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. conda create -n my_env
  2. conda activate my_env
  3. conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
  4. conda install -c conda-forge accelerate
  5. run accelerate launch test.py. test.py could be an existing or non-existing file, the command halts before reach where the file matters. That being said, the test.py I used is
import torch
import torch.nn as nn
from accelerate import Accelerator

if __name__ == "__main__":
    accelerator = Accelerator()
    model = nn.Conv2d(10, 20, 3, 1, 1)
    model = accelerator.prepare(model)

Expected behavior

Expected behavior is if test.py exists, it gets executed.
The actual behavior is the program halts forever after running accelerate launch, without printing anything. Ctrl-C cannot kill it, I have to Ctrl-Z. Then there's still something running on port 29500 so next time I have to kill it manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant