Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add an LLM engine #1127

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

[RFC] Add an LLM engine #1127

wants to merge 2 commits into from

Conversation

JianyuZhan
Copy link

@JianyuZhan JianyuZhan commented Aug 16, 2024

Motivation

Edited 8/18: now it's complte, see below coversation for new PR description.

This is not complete work, just a PoC and request for comment.

This PR adds a LLM engine, addressing the Roadmap item Add APIs for using the inference engine in a single script without launching a separate server.

The demo usage is in examples/usage/llm_engine.py:

from sglang import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The capital of China is",
    "What is the meaning of life?",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="deepseek-ai/deepseek-llm-7b-chat")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
    print('===============================')
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Modification

It adds:

  1. a core Engine, which wraps the core logic of what current server.launch_server() does, with addition of shutdown logic to gracefully bring down the ZMQ sockets in the TokenizationManager when finishing the job.
  2. The class SamplingParams is exposed as an API now(The GenerateReqInput has a dict version of SamplingParams, and the interel logic use the class version, which means if we expose it as API, we need a circuitous transform from class to dict then to class, need to somehow fix later).
  3. Along the way, I also add EngineArgs, and make a new ServerArgs a thin wrapper of it(see sglang/srt/serving/engine_args.py and sglang/srt/serving/server_args.py in the commit), and some config objects built from EngineArgs, like ModelConfig, ScheduleConfig, ParallelConfig, etc, a mimic of vllm. This opens up an opportunity to clean up internal passing of ServerArgs arround many functions, and to draw a more clean APIs for different sub-components. But I didn't make this modification yet(these files are added, but take no effect now in the server code logic), it is quite intruisive to the current code base, thus I make this PR for RFC.

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

Copy link
Member

@Ying1123 Ying1123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! It looks good at a high level. My review is still in progress, and other committers will also take a look.

python/sglang/srt/managers/controller_multi.py Outdated Show resolved Hide resolved
@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

@JianyuZhan Can you fully verify locally before committing now? I'm currently troubleshooting CI issues, which will be affected.

@JianyuZhan
Copy link
Author

JianyuZhan commented Aug 18, 2024

Hi, @Ying1123 , @zhyncs , now this PR is complete, and passed all CI tests(The previous e2e-test failure is due to missing PYTHONPATH setting, so I added one in the e2e-test.yaml and the test passed) . It is keeping rebased to upstream/main branch and is ready for review.

This PR makes modifications as below:

  1. Added an Engine and EngineArgs, in parallel with Server and ServerArgs (all in sglang/srt/serving/). And now ServerArgs is a thin wrapper on EngineArgs, with server-specific args like host, port, api key, and OpenAI API related stuff. And ServerArgs transparently pass all args belonging to EngineArgs to it, rendering users of ServerArgs EngineArgs-agnostic. And Server is built on Engine, thus resulting in a succinct launch_server API and implementation.
  2. Based on 1, we can now have two serving methods. One is the old server api , and the other is the Engine API without running a server, see examples/usage/llm_engine.py. Thus I put them in a standalone folder sglang/srt/serving/.
  3. Along the way, we can now get rid of ServerArgs in internal APIs(mangers/, model_executor/, etc). Instead we incorporate ModelConfig , ScheduleConfig, ParallelConfig, OptimizationConfig, ObservabilityConfig (all created from EngineArgs and built upon Engine creation time). And those internal APIs are now communicating with each other using these *Config objects; and it results in a quite coherent interfaces , in sense of the component abastraction and its dependency.

@JianyuZhan
Copy link
Author

I have rebased upon the latest upstream/main branch, and it passed all CI tests now.

@merrymercy merrymercy changed the title [RFC]Add an LLM engine [RFC] Add an LLM engine Aug 20, 2024
@DragonFive
Copy link

DragonFive commented Aug 29, 2024

@JianyuZhan Hi, I try use your repo to test, I clone the code with main branch , and run

pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu118/torch2.4/

but when I run follow test

from sglang import LLM, SamplingParams

but is says

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from sglang import LLM, SamplingParams

ImportError: cannot import name 'LLM' from 'sglang' (unknown location)

How can i fix it, looking for your help!

@JianyuZhan
Copy link
Author

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush.
`

@DragonFive
Copy link

@DragonFive , Add "LLM" and "SamplingParams" in sglang/__init__.py::__all__, it is now imported in this init.py but not exposed, or you can try from sglang.api import LLM, SamplingParams. The code lags behind upstream and I will later rebase and repush. `

It works fine for me, thanks for your contribution!

@jischein
Copy link

jischein commented Aug 29, 2024

Running into the following issues (surfaced via tp_worker.py) when trying to query Llama 3.1 405B FP8 on an 8xH100 while setting tensor_parallel_size=8.
Note: requests to Llama 3.1 8B Instruct are successful (i.e. with tp=1, everything runs as intended)

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
prompts = ["Hello, my name is"]
res = llm.generate(prompts)
[18:19:55 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.43, #queue-req: 0
[18:19:57 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.32, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[18:19:57 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer

Process Process-1:2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:9425: Connection reset by peer
[18:19:57 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer

[18:19:57 TP4] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139

Process Process-1:1:
Process Process-1:4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
Traceback (most recent call last):
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:59799: Connection reset by peer
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:43139
[18:19:57 TP6] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813

Process Process-1:6:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:37813
[18:19:57 TP3] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914

Process Process-1:3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:52914
[18:19:57 TP5] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369

Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:63369
>>> [18:19:57 TP7] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

Process Process-1:7:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.233.87.118]:56545

@JianyuZhan
Copy link
Author

@jischein , Thanks for testing. I don't have this multi-GPU environment to test. Per my analysis, your error looks like the tp procs are not terminated properly in Engine::shutdown() . I tried to fix it, would you mind testing the new code I just pushed?

@feifei-111
Copy link

this PR will raise "AttributeError: 'Engine' object has no attribute 'tp_procs'" when do inference with one gpu, need add self.tp_procs=None in Engine.startup

@@ -195,9 +195,9 @@ def load_model(self):
monkey_patch_vllm_qvk_linear_loader()

self.dtype = self.vllm_model_config.dtype
if self.model_config.model_override_args is not None:
if self.model_config.model_overide_args is not None:
Copy link

@jischein jischein Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be model_override_args; I tried compiling the engine with tp=8 and got

(there is a typo, as this doesn't match fn signature in get_config)

>>> from sglang import LLM, SamplingParams
>>> llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 190, in __init__
    self.llm_engine = Engine(engine_args)
  File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 50, in __init__
    self.startup()
  File "/home/ubuntu/sglang/python/sglang/srt/serving/engine.py", line 89, in startup
    self.tokenizer_manager = TokenizerManager(self.engine_args)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 93, in __init__
    self.hf_config = get_config(
TypeError: get_config() got an unexpected keyword argument 'model_overide_args'
>>>

trust_remote_code=server_args.trust_remote_code,
model_override_args=model_override_args,
trust_remote_code=engine_args.trust_remote_code,
model_overide_args=engine_args.model_override_args,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here; this should be model_override_args to match fn signature in get_config

self.port_args = port_args
self.model_override_args = model_override_args
self.engine_args = engine_args
self.model_overide_args = engine_args.model_override_args
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

engine_args.model_path,
engine_args.trust_remote_code,
context_length=engine_args.context_length,
model_overide_args=engine_args.model_override_args,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well

@jischein
Copy link

jischein commented Sep 1, 2024

@JianyuZhan unfortunately still running into errors after cleaning up the typos

llm = LLM(model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8)
>>> prompts = ["Hi my name is"]
>>> res=llm.generate(prompts)
[17:53:56 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 0, cache hit rate: 0.00%, #running-req: 0, #queue-req: 0
[17:53:57 TP0] Decode batch. #running-req: 1, #token: 45, token usage: 0.00, gen throughput (token/s): 0.49, #queue-req: 0
[17:53:58 TP0] Decode batch. #running-req: 1, #token: 85, token usage: 0.00, gen throughput (token/s): 32.50, #queue-req: 0
[17:54:00 TP0] Decode batch. #running-req: 1, #token: 125, token usage: 0.00, gen throughput (token/s): 32.47, #queue-req: 0
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
[17:54:00 TP1] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer

Process Process-1:1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:51919: Connection reset by peer
[17:54:00 TP2] Exception in run_tp_server:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 892, in run_tp_server
    recv_reqs = broadcast_recv_input(None, tp_rank, tp_cpu_group)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 938, in broadcast_recv_input
    dist.broadcast(tensor_size, src=0, group=dist_group)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/vllm-foundry/env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2213, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.233.87.118]:2179: Connection reset by peer

@jischein
Copy link

jischein commented Sep 1, 2024

JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo

@DragonFive
Copy link

JianyuZhan#1 — @JianyuZhan this compiles / addresses the typo

Change all the 'model_overide_args' to 'model_override_args' in the repo will work well.

@DragonFive
Copy link

@JianyuZhan It runs well before I ungrade sglang to v0.3.0 on llama3.1-8b, after that I encounter some confused error :

10:46:19.553 [10:46:19 TP0] Exception in ControllerSingle:
10:46:19.553 Traceback (most recent call last):
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 157, in start_controller_process
10:46:19.553     controller.loop_for_forward()
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/controller_single.py", line 98, in loop_for_forward
10:46:19.553     out_pyobjs = self.tp_server.exposed_step(recv_reqs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 243, in exposed_step
10:46:19.553     self.forward_step()
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 259, in forward_step
10:46:19.553     self.forward_prefill_batch(new_batch)
10:46:19.553   File "/github_sglang/python/sglang/srt/managers/tp_worker.py", line 506, in forward_prefill_batch
10:46:19.553     sample_output, logits_output = self.model_runner.forward(
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 591, in forward
10:46:19.553     return self.forward_extend(batch)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/model_executor/model_runner.py", line 555, in forward_extend
10:46:19.553     return self.model.forward(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
10:46:19.553     return func(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 317, in forward
10:46:19.553     hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 282, in forward
10:46:19.553     hidden_states, residual = layer(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.553   File "/github_sglang/python/sglang/srt/models/llama.py", line 232, in forward
10:46:19.553     hidden_states = self.self_attn(
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.553     return self._call_impl(*args, **kwargs)
10:46:19.553   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.553     return forward_call(*args, **kwargs)
10:46:19.554   File "/github_sglang/python/sglang/srt/models/llama.py", line 168, in forward
10:46:19.554     q, k = self.rotary_emb(positions, q, k)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
10:46:19.554     return self._call_impl(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
10:46:19.554     return forward_call(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 14, in forward
10:46:19.554     return self._forward_method(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 216, in forward_cuda
10:46:19.554     ops.rotary_embedding(positions, query, key, self.head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 37, in wrapper
10:46:19.554     raise e
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 28, in wrapper
10:46:19.554     return fn(*args, **kwargs)
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/vllm/_custom_ops.py", line 138, in rotary_embedding
10:46:19.554     torch.ops._C.rotary_embedding(positions, query, key, head_size,
10:46:19.554   File "/usr/local/lib/python3.9/site-packages/torch/_ops.py", line 1170, in __getattr__
10:46:19.554     raise AttributeError(
10:46:19.554 AttributeError: '_OpNamespace' '_C' object has no attribute 'rotary_embedding'

@JianyuZhan
Copy link
Author

JianyuZhan commented Sep 6, 2024

@DragonFive it is because the vllm dependency is upgraded, I think, you shall update your local installation dependency as well: pip install --upgrade pip pip install -e "python[all]", check the install section in the README.

@yangky11
Copy link

yangky11 commented Sep 7, 2024

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

@zhyncs
Copy link
Member

zhyncs commented Sep 8, 2024

Hi, thank you for this PR. I'm looking forward to trying it out. I'm wondering if there is plan to support asynchronous operations similar to vllm. AsyncLLMEngine.

Hi @yangky11 Maybe you can try this

async def async_generate(

@jischein
Copy link

jischein commented Sep 13, 2024

@JianyuZhan @zhyncs is this close to being merged? Would love to start using

@merrymercy merrymercy mentioned this pull request Sep 22, 2024
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants