Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Existence of system-wide version of a shared library causes undefined symbol error #1640

Open
Garbaz opened this issue Aug 2, 2024 · 10 comments

Comments

@Garbaz
Copy link

Garbaz commented Aug 2, 2024

To reproduce (assuming you have libnvjitlink12 installed system-wide, and in a different version):

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

The final line gives me this error:

Error in py_module_import(module, convert = convert) : 
  ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12

Checking nm -gDC ~/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 | grep nvJitLinkAddData is get:

0000000000262eb0 T nvJitLinkAddData@@libnvJitLink.so.12
0000000000263070 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12
0000000000263080 T __nvJitLinkAddData_12_1@@libnvJitLink.so.12
0000000000263090 T __nvJitLinkAddData_12_2@@libnvJitLink.so.12
00000000002630a0 T __nvJitLinkAddData_12_3@@libnvJitLink.so.12
00000000002630b0 T __nvJitLinkAddData_12_4@@libnvJitLink.so.12
00000000002630c0 T __nvJitLinkAddData_12_5@@libnvJitLink.so.12
00000000002630d0 T __nvJitLinkAddData_12_6@@libnvJitLink.so.12

So the version of libnvJitLink.so.12 in the virtualenv has the symbol. And if I activate the virtualenv normally in a shell and import torch from inside a normal Python REPL I don't get any errors. So it's not the fault of libcusparse.so.12.

The thing is though, the library libnvJitLink.so.12 is also installed system-wide, but in a different version. Checking there with nm -gDC /usr/lib/x86_64-linux-gnu/libnvJitLink.so.12 | grep nvJitLinkAddData, I get only:

0000000000226bd0 T __nvJitLinkAddData_12_0@@libnvJitLink.so.12

And when I remove the system-wide version of the library with

sudo apt remove libnvjitlink12:amd64

the error no longer occurs.

It appears to be that if there is a system-wide version of a shared library, it is preferred over the local version in the virtualenv. This is not how it things should be!

R version is 4.4.1 (2024-06-14) and reticulate version is reticulate_1.38.0.

@Garbaz
Copy link
Author

Garbaz commented Aug 2, 2024

To be clear, sudo apt remove libnvjitlink12:amd64 is not really a solution to this problem.

@t-kalinowski
Copy link
Member

t-kalinowski commented Aug 2, 2024

Thanks for reporting!

Are you using the RStudio IDE? Does this happen only in the RStudio IDE, or outside the IDE too?

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Ah, I should have added that I'm using R Studio Server. And I should have tested running the repro code directly in R.

I don't have access to a machine at the moment where I can test running the code in normal R Studio Desktop, so I can't check whether it's a R Studio Server specific issue. But running source("repro.R"), where repro.R contains the repro code:

library(reticulate)

venv_name <- "deleteme_5267"
virtualenv_create(venv_name)
use_virtualenv(venv_name)

py_install("torch", pip = true)

pytorch  <- import("torch")

I do not get the error. And running e.g. pytorch$cuda$is_available() works as expected.

So it appears to be an interactive between R Studio (Server) and Reticulate that is the issue.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Wait, scratch that, I forgot I uninstalled libnvjitlink12 to temporarily fix the issue. Reinstalling it, I get the same error in plain R!

So it has nothing to do with R Studio (Server) in particular.

@t-kalinowski
Copy link
Member

I don't think reticulate is modifying the order of loaded libs.

If this occurs with reticulate::import("torch") in R, but not in a terminal with ~/.virtualenvs/r-torch/bin/python -c 'import torch', then it's likely that something in the R session is either

  1. Modifying LD_LIBRARY_PATH
  2. Pre-loading the "wrong" libnvjitlink12 for some reason.

Can you please double-check the value of Sys.getenv("LD_LIBRARY_PATH") in R, and also, inspect other R startup files for code that might be causing this (.Rprofile, .Renviron, etc.)?

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Both Sys.getenv("LD_LIBRARY_PATH") and os <- import("os"); os$environ["LD_LIBRARY_PATH"] give:

"/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server"

What I do find weird is that there is no mention of the virtualenv, even in os$environ["LD_LIBRARY_PATH"], even though, evidently, the libraries from the virtualenv are found.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

Okay, it appears Python does not simply use the LD_LIBRARY_PATH environment variable. At least when I run os.environ["LD_LIBRARY_PATH"] in the normal python REPL (from the virtualenv), I get a key error.

However, Python does use an environment variable PYTHONPATH. Running os <- import("os"); os$environ["PYTHONPATH"] in R I get:

"/usr/local/lib/R/site-library/reticulate/config:/usr/lib/python312.zip:/usr/lib/python3.12:/usr/lib/python3.12/lib-dynload:/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages:/usr/local/lib/R/site-library/reticulate/python"

I will investigate whether I can fix the issue by messing with PYTHONPATH.

Update: I have experimented with both LD_LIBRARY_PATH and PYTHONPATH and could not get the issue to go away. I will continue trying to figure this out later this week.

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

By the way, py_last_error() gives:

--- Python Exception Message
Traceback (most recent call last):
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/__init__.py", line 290, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 122, in _find_and_load_hook
    return _run_hook(name, _hook)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 96, in _run_hook
    module = hook()
             ^^^^^^
  File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 120, in _hook
    return _find_and_load(name, import_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/tobi/.virtualenvs/deleteme_5267/lib/python3.12/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12
--- R Traceback
    ▆
 1. └─reticulate::import("torch")
 2.   └─reticulate:::py_module_import(module, convert = convert)
See `reticulate::py_last_error()$r_trace$full_call` for more details.

In case that's of any help.

@t-kalinowski
Copy link
Member

t-kalinowski commented Aug 5, 2024

I am unable to reproduce locally.

Note that PyTorch can be installed a few different ways, depending on your environment. You may want to consult https://pytorch.org/get-started/locally/ and see if there is something that will work better for you than a bare pip install torch (e.g., pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

@Garbaz
Copy link
Author

Garbaz commented Aug 5, 2024

I do not think this has anything to do with torch in particular. The reason torch recommends using their bespoke pypi repo has to do with driver/CUDA version incompatibilities and shouldn't change anything about the issue here. I will however try this for completeness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants