Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch._C._LinAlgError: linalg.svd: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 15). #1599

Open
CharisWg opened this issue Jul 24, 2024 · 1 comment
Labels

Comments

@CharisWg
Copy link

C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\python.exe D:\Charis\SSL-yolo8\lightly-master\examples\pytorch\mmcr_yolo.py
WARNING ⚠️ no model scale passed. Assuming scale='n'.
class_name is: MMCR
save_path is: D:\Charis\SSL-yolo8\lightly-master\runs\MMCR
Starting Training
epoch: 00, loss: -2415920191337764664519950336.00000
after training
tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0')
tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0')
tensor([ -0.7926, -2.2815, -0.7858, -14.8213, -16.7507], device='cuda:0')
tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0')
after saving training + has backbone.load_state_dict
tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0')
tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0')
tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0')
tensor([-0.4687, -0.7416, -0.3247, -4.7035, -5.2732], device='cuda:0')
save full_path is: D:\Charis\SSL-yolo8\lightly-master\runs\MMCR\MMCR_coca_alldcm_MMCRTransform.pth
Saving model for MMCR_coca_alldcm_MMCRTransform.pth at Epoch 1
Finding optimal model params. Loss is dropping from -2415920191337764664519950336.0000 to -2415920191337764664519950336.0000
D:\Charis\SSL-yolo8\lightly-master\lightly\loss\mmcr_loss.py:60: UserWarning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0, 1, 2, 3, 4, and other 123 batches failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\linalg\BatchLinearAlgebraLib.cpp:703.)
_, S_z, _ = svd(z)
Traceback (most recent call last):
File "D:\Charis\SSL-yolo8\lightly-master\examples\pytorch\mmcr_yolo.py", line 158, in
loss = criterion(z_o, z_m)
File "C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\LocalAdmin\anaconda3\envs\lightlyyolo\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Charis\SSL-yolo8\lightly-master\lightly\loss\mmcr_loss.py", line 60, in forward
_, S_z, _ = svd(z)
torch._C._LinAlgError: linalg.svd: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 15).

Process finished with exit code 1

@guarin
Copy link
Contributor

guarin commented Aug 16, 2024

Hi, sorry for the late reply. It looks like your loss is way too large (2415920191337764664519950336.00000). Maybe try decreasing the learning rate or check your gradient values (clip them if necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants