[RFC] Add an LLM engine #1127

JianyuZhan · 2024-08-16T17:19:02Z

Motivation

Edited 8/18: now it's complte, see below coversation for new PR description.

This is not complete work, just a PoC and request for comment.

This PR adds a LLM engine, addressing the Roadmap item Add APIs for using the inference engine in a single script without launching a separate server.

The demo usage is in examples/usage/llm_engine.py:

from sglang import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of China is",
    "What is the meaning of life?",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
    print('===============================')
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Modification

It adds:

a core Engine, which wraps the core logic of what current server.launch_server() does, with addition of shutdown logic to gracefully bring down the ZMQ sockets in the TokenizationManager when finishing the job.
The class SamplingParams is exposed as an API now(The GenerateReqInput has a dict version of SamplingParams, and the interel logic use the class version, which means if we expose it as API, we need a circuitous transform from class to dict then to class, need to somehow fix later).
Along the way, I also add EngineArgs, and make a new ServerArgs a thin wrapper of it(see sglang/srt/serving/engine_args.py and sglang/srt/serving/server_args.py in the commit), and some config objects built from EngineArgs, like ModelConfig, ScheduleConfig, ParallelConfig, etc, a mimic of vllm. This opens up an opportunity to clean up internal passing of ServerArgs arround many functions, and to draw a more clean APIs for different sub-components. But I didn't make this modification yet(these files are added, but take no effect now in the server code logic), it is quite intruisive to the current code base, thus I make this PR for RFC.

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

Ying1123

Thanks for your contribution! It looks good at a high level. My review is still in progress, and other committers will also take a look.

python/sglang/srt/managers/controller_multi.py

zhyncs · 2024-08-17T12:29:56Z

@JianyuZhan Can you fully verify locally before committing now? I'm currently troubleshooting CI issues, which will be affected.

JianyuZhan · 2024-08-18T05:19:08Z

Hi, @Ying1123 , @zhyncs , now this PR is complete, and passed all CI tests(The previous e2e-test failure is due to missing PYTHONPATH setting, so I added one in the e2e-test.yaml and the test passed) . It is keeping rebased to upstream/main branch and is ready for review.

This PR makes modifications as below:

Added an Engine and EngineArgs, in parallel with Server and ServerArgs (all in sglang/srt/serving/). And now ServerArgs is a thin wrapper on EngineArgs, with server-specific args like host, port, api key, and OpenAI API related stuff. And ServerArgs transparently pass all args belonging to EngineArgs to it, rendering users of ServerArgs EngineArgs-agnostic. And Server is built on Engine, thus resulting in a succinct launch_server API and implementation.
Based on 1, we can now have two serving methods. One is the old server api , and the other is the Engine API without running a server, see examples/usage/llm_engine.py. Thus I put them in a standalone folder sglang/srt/serving/.
Along the way, we can now get rid of ServerArgs in internal APIs(mangers/, model_executor/, etc). Instead we incorporate ModelConfig , ScheduleConfig, ParallelConfig, OptimizationConfig, ObservabilityConfig (all created from EngineArgs and built upon Engine creation time). And those internal APIs are now communicating with each other using these *Config objects; and it results in a quite coherent interfaces , in sense of the component abastraction and its dependency.

JianyuZhan · 2024-08-19T04:30:07Z

I have rebased upon the latest upstream/main branch, and it passed all CI tests now.

DragonFive · 2024-08-29T05:06:29Z

@JianyuZhan Hi, I try use your repo to test, I clone the code with main branch , and run

pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu118/torch2.4/

but when I run follow test

from sglang import LLM, SamplingParams

but is says

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sglang import LLM, SamplingParams

ImportError: cannot import name 'LLM' from 'sglang' (unknown location)

How can i fix it, looking for your help!

JianyuZhan · 2024-08-29T05:39:40Z

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush.
`

DragonFive · 2024-08-29T07:36:07Z

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush. `

It works fine for me, thanks for your contribution!

jischein · 2024-08-29T18:27:48Z

Running into the following issues (surfaced via tp_worker.py) when trying to query Llama 3.1 405B FP8 on an 8xH100 while setting tensor_parallel_size=8.
Note: requests to Llama 3.1 8B Instruct are successful (i.e. with tp=1, everything runs as intended)

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
prompts = ["Hello, my name is"]
res = llm.generate(prompts)
[18:19:55 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.43, #queue-req: 0
[18:19:57 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.32, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[18:19:57 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer

Process Process-1:2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer
[18:19:57 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer

[18:19:57 TP4] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139

Process Process-1:1:
Process Process-1:4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
Traceback (most recent call last):
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139
[18:19:57 TP6] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813

Process Process-1:6:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813
[18:19:57 TP3] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914

Process Process-1:3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914
[18:19:57 TP5] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369

Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369
>>> [18:19:57 TP7] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

Process Process-1:7:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

JianyuZhan · 2024-08-30T03:01:11Z

@jischein , Thanks for testing. I don't have this multi-GPU environment to test. Per my analysis, your error looks like the tp procs are not terminated properly in Engine::shutdown() . I tried to fix it, would you mind testing the new code I just pushed?

feifei-111 · 2024-08-30T06:53:00Z

this PR will raise "AttributeError: 'Engine' object has no attribute 'tp_procs'" when do inference with one gpu, need add self.tp_procs=None in Engine.startup

jischein · 2024-09-01T17:46:05Z

python/sglang/srt/model_executor/model_runner.py

@@ -195,9 +195,9 @@ def load_model(self):
            monkey_patch_vllm_qvk_linear_loader()

        self.dtype = self.vllm_model_config.dtype
-        if self.model_config.model_override_args is not None:
+        if self.model_config.model_overide_args is not None:


This should be model_override_args; I tried compiling the engine with tp=8 and got

(there is a typo, as this doesn't match fn signature in get_config)

>>> from sglang import LLM, SamplingParams >>> llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 190, in __init__ self.llm_engine = Engine(engine_args) File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 50, in __init__ self.startup() File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 89, in startup self.tokenizer_manager = TokenizerManager(self.engine_args) File "/home/ubuntu/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__ self.hf_config = get_config( TypeError: get_config() got an unexpected keyword argument 'model_overide_args' >>>

jischein · 2024-09-01T17:46:45Z

python/sglang/srt/managers/tokenizer_manager.py

-            trust_remote_code=server_args.trust_remote_code,
-            model_override_args=model_override_args,
+            trust_remote_code=engine_args.trust_remote_code,
+            model_overide_args=engine_args.model_override_args,


same here; this should be model_override_args to match fn signature in get_config

jischein · 2024-09-01T17:47:01Z

python/sglang/srt/managers/controller_multi.py

-        self.port_args = port_args
-        self.model_override_args = model_override_args
+        self.engine_args = engine_args
+        self.model_overide_args = engine_args.model_override_args


jischein · 2024-09-01T17:47:19Z

python/sglang/srt/managers/tp_worker.py

+            engine_args.model_path,
+            engine_args.trust_remote_code,
+            context_length=engine_args.context_length,
+            model_overide_args=engine_args.model_override_args,


here as well

jischein · 2024-09-01T17:56:09Z

@JianyuZhan unfortunately still running into errors after cleaning up the typos

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
>>> prompts = ["Hi my name is"]
>>> res=llm.generate(prompts)
[17:53:56 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
[17:53:57 TP0] Decode batch. #running-req: 1, #token: 45, token usage: 0.00, gen throughput (token/s): 0.49, #queue-req: 0
[17:53:58 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.50, #queue-req: 0
[17:54:00 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.47, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[17:54:00 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer

Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer
[17:54:00 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:2179: Connection reset by peer

jischein · 2024-09-01T18:03:03Z

JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo

DragonFive · 2024-09-02T01:58:20Z

JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo

Change all the 'model_overide_args' to 'model_override_args' in the repo will work well.

DragonFive · 2024-09-06T02:57:44Z

@JianyuZhan It runs well before I ungrade sglang to v0.3.0 on llama3.1-8b, after that I encounter some confused error ：

10:46:19.553 [10:46:19 TP0] Exception in ControllerSingle:
10:46:19.553 Traceback (most recent call last):
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 157, in start_controller_process
10:46:19.553     controller.loop_for_forward()
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 98, in loop_for_forward
10:46:19.553     out_pyobjs = self.tp_server.exposed_step(recv_reqs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 243, in exposed_step
10:46:19.553     self.forward_step()
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 259, in forward_step
10:46:19.553     self.forward_prefill_batch(new_batch)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 506, in forward_prefill_batch
10:46:19.553     sample_output, logits_output = self.model_runner.forward(
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 591, in forward
10:46:19.553     return self.forward_extend(batch)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 555, in forward_extend
10:46:19.553     return self.model.forward(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 317, in forward
10:46:19.553     hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 282, in forward
10:46:19.553     hidden_states, residual = layer(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 232, in forward
10:46:19.553     hidden_states = self.self_attn(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.554   File "/github_sglang/python/sglang/srt/models/llama.py", line 168, in forward
10:46:19.554     q, k = self.rotary_emb(positions, q, k)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.554     return self._call_impl(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.554     return forward_call(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 14, in forward
10:46:19.554     return self._forward_method(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 216, in forward_cuda
10:46:19.554     ops.rotary_embedding(positions, query, key, self.head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 37, in wrapper
10:46:19.554     raise e
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 28, in wrapper
10:46:19.554     return fn(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 138, in rotary_embedding
10:46:19.554     torch.ops._C.rotary_embedding(positions, query, key, head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/_ops.py", line 1170, in __getattr__
10:46:19.554     raise AttributeError(
10:46:19.554 AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding'

JianyuZhan · 2024-09-06T03:20:29Z

@DragonFive it is because the vllm dependency is upgraded, I think, you shall update your local installation dependency as well: pip install --upgrade pip pip install -e "python[all]", check the install section in the README.

yangky11 · 2024-09-07T14:22:56Z

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

zhyncs · 2024-09-08T20:44:17Z

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

Hi @yangky11 Maybe you can try this

sglang/python/sglang/srt/server.py

Line 562 in 05bea68

async def async_generate(

jischein · 2024-09-13T20:25:14Z

@JianyuZhan @zhyncs is this close to being merged? Would love to start using

Ying1123 reviewed Aug 17, 2024

View reviewed changes

python/sglang/srt/managers/controller_multi.py Outdated Show resolved Hide resolved

Ying1123 mentioned this pull request Aug 17, 2024

Development Roadmap (2024 Q3) #634

Open

30 tasks

JianyuZhan force-pushed the main branch from fd55f89 to d724625 Compare August 17, 2024 03:32

Ying1123 self-assigned this Aug 17, 2024

JianyuZhan force-pushed the main branch 5 times, most recently from 4423ba7 to a65ceec Compare August 17, 2024 12:23

JianyuZhan force-pushed the main branch 5 times, most recently from 0bd2a60 to b779199 Compare August 18, 2024 04:59

JianyuZhan force-pushed the main branch 3 times, most recently from b33164f to 1d0edcd Compare August 19, 2024 04:06

JianyuZhan force-pushed the main branch 8 times, most recently from 38a6146 to 43e01f7 Compare August 20, 2024 15:20

merrymercy changed the title ~~[RFC]Add an LLM engine~~ [RFC] Add an LLM engine Aug 20, 2024

JianyuZhan force-pushed the main branch from 43e01f7 to 780f099 Compare August 20, 2024 16:00

JianyuZhan force-pushed the main branch 4 times, most recently from 405f18d to 9c06d6d Compare August 27, 2024 09:14

JianyuZhan force-pushed the main branch from 9c06d6d to ea8b3b4 Compare August 29, 2024 13:40

JianyuZhan force-pushed the main branch from ea8b3b4 to 51308ce Compare August 30, 2024 02:57

JianyuZhan force-pushed the main branch 2 times, most recently from 60efeb7 to 581f436 Compare September 1, 2024 13:02

jischein suggested changes Sep 1, 2024

View reviewed changes

JianyuZhan force-pushed the main branch 2 times, most recently from a00e992 to 688cb2c Compare September 3, 2024 10:57

merrymercy and others added 2 commits September 3, 2024 10:58

Update README.md for llava-onevision instructions (sgl-project#1313)

39b11d7

Add LLM Engine

7df54ad

JianyuZhan force-pushed the main branch from 688cb2c to 7df54ad Compare September 3, 2024 10:58

stikkireddy mentioned this pull request Sep 9, 2024

[RFC] Batch inference using ez_deploy_config stikkireddy/mlflow-extensions#19

Open

merrymercy mentioned this pull request Sep 22, 2024

Development Roadmap (2024 Q4) #1487

Open

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add an LLM engine #1127

[RFC] Add an LLM engine #1127

JianyuZhan commented Aug 16, 2024 •

edited

Loading

Ying1123 left a comment

zhyncs commented Aug 17, 2024

JianyuZhan commented Aug 18, 2024 •

edited

Loading

JianyuZhan commented Aug 19, 2024

DragonFive commented Aug 29, 2024 •

edited

Loading

JianyuZhan commented Aug 29, 2024

DragonFive commented Aug 29, 2024

jischein commented Aug 29, 2024 •

edited

Loading

JianyuZhan commented Aug 30, 2024

feifei-111 commented Aug 30, 2024

jischein Sep 1, 2024 •

edited

Loading

jischein Sep 1, 2024

jischein Sep 1, 2024

jischein Sep 1, 2024

jischein commented Sep 1, 2024 •

edited

Loading

jischein commented Sep 1, 2024

DragonFive commented Sep 2, 2024

DragonFive commented Sep 6, 2024

JianyuZhan commented Sep 6, 2024 •

edited

Loading

yangky11 commented Sep 7, 2024

zhyncs commented Sep 8, 2024

jischein commented Sep 13, 2024 •

edited

Loading

[RFC] Add an LLM engine #1127

Are you sure you want to change the base?

[RFC] Add an LLM engine #1127

Conversation

JianyuZhan commented Aug 16, 2024 • edited Loading

Motivation

Modification

Checklist

Ying1123 left a comment

Choose a reason for hiding this comment

zhyncs commented Aug 17, 2024

JianyuZhan commented Aug 18, 2024 • edited Loading

JianyuZhan commented Aug 19, 2024

DragonFive commented Aug 29, 2024 • edited Loading

JianyuZhan commented Aug 29, 2024

DragonFive commented Aug 29, 2024

jischein commented Aug 29, 2024 • edited Loading

JianyuZhan commented Aug 30, 2024

feifei-111 commented Aug 30, 2024

jischein Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

jischein Sep 1, 2024

Choose a reason for hiding this comment

jischein Sep 1, 2024

Choose a reason for hiding this comment

jischein Sep 1, 2024

Choose a reason for hiding this comment

jischein commented Sep 1, 2024 • edited Loading

jischein commented Sep 1, 2024

DragonFive commented Sep 2, 2024

DragonFive commented Sep 6, 2024

JianyuZhan commented Sep 6, 2024 • edited Loading

yangky11 commented Sep 7, 2024

zhyncs commented Sep 8, 2024

jischein commented Sep 13, 2024 • edited Loading

JianyuZhan commented Aug 16, 2024 •

edited

Loading

JianyuZhan commented Aug 18, 2024 •

edited

Loading

DragonFive commented Aug 29, 2024 •

edited

Loading

jischein commented Aug 29, 2024 •

edited

Loading

jischein Sep 1, 2024 •

edited

Loading

jischein commented Sep 1, 2024 •

edited

Loading

JianyuZhan commented Sep 6, 2024 •

edited

Loading

jischein commented Sep 13, 2024 •

edited

Loading