Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered #188

Open
A91A981E opened this issue Jan 25, 2023 · 1 comment
Open

RuntimeError: CUDA error: device-side assert triggered #188

A91A981E opened this issue Jan 25, 2023 · 1 comment

Comments

@A91A981E
Copy link

A91A981E commented Jan 25, 2023

❓ Questions and Help

Happy Chinese New Year!
I tried to train this model with VG. I followed README to get start and met some problem with mix precision. So I use float32. When process went to 4812-th iteration with 12 batch size, this error occurred. Full content as follow:

Traceback (most recent call last):
  File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 3215, in <module>
  File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 3208, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 2282, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/pydevd.py", line 2289, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/root/.vscode-server/extensions/ms-python.python-2021.2.633441544/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydev_imps/_pydev_execfile.py", line 25, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "tools/relation_train_net.py", line 383, in <module>
    main()
  File "tools/relation_train_net.py", line 376, in main
    model = train(cfg, args.local_rank, args.distributed, logger)
  File "tools/relation_train_net.py", line 164, in train
    scaled_losses.backward()
  File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I noticed the NOTE in relation_train_net.py line 161, so I commented out:

with amp.scale_loss(losses, optimizer) as scaled_losses:
        scaled_losses.backward()

and use

losses.backward()

It's not working... And error came to:

Traceback (most recent call last):                                                                                                               
  File "tools/relation_train_net.py", line 384, in <module>                                                                                      
    main()                                                                                                                                       
  File "tools/relation_train_net.py", line 377, in main                                                                                          
    model = train(cfg, args.local_rank, args.distributed, logger)                                                                                
  File "tools/relation_train_net.py", line 165, in train                                                                                         
    losses.backward()                                                                                                                            
  File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward                         
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)                                                           
  File "/root/miniconda3/envs/scene_graph_benchmark/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward               
    Variable._execution_engine.run_backward(                                                                                                     
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Have anyone met this issue before?

@dyang-TUM
Copy link

@A91A981E Did you solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants