Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE torch compile #1497

Merged
merged 1 commit into from
Sep 24, 2024
Merged

MoE torch compile #1497

merged 1 commit into from
Sep 24, 2024

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Sep 24, 2024

Motivation

Temporarily workaround MoE torch compile with monkey patch.

Bench Latency

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-V2-Lite --disable-radix --trust-remote-code --input-len 128 --output-len 8 --batch 1 --enable-torch-compile --max-torch-compile-bs 1

# bs=1, w/o torch compile
Decode.  median latency: 0.00921 s, median throughput:    108.61 token/s
Total. latency:  0.101 s, throughput:   1352.80 token/s

# bs=1, w/ torch compile, skip moe (main)
Decode.  median latency: 0.00735 s, median throughput:    136.13 token/s
Total. latency:  0.086 s, throughput:   1574.70 token/s

# bs=1, w/ torch compile + moe (this PR)
Decode.  median latency: 0.00682 s, median throughput:    146.71 token/s
Total. latency:  0.083 s, throughput:   1632.26 token/s

Evaluation

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 30000 --trust-remote-code --host 0.0.0.0  --enable-torch-compile --max-torch-compile-bs 1 --max-running-requests 1
python3 benchmark/gsm8k/bench_sglang.py --num-questions 200

Accuracy: 0.825
Invalid: 0.000
Latency: 189.962 s
Output throughput: 136.559 token/s

@merrymercy merrymercy merged commit 8d4ed42 into sgl-project:main Sep 24, 2024
11 of 12 checks passed
@HaiShaw
Copy link
Contributor

HaiShaw commented Sep 29, 2024

@ispobock , On v0.3.2, I see an issue if add --quant fp8, as below:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 511, in <module>
[rank0]:     raise e
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 509, in <module>
[rank0]:     main(server_args, bench_args)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 472, in main
[rank0]:     work_func(server_args, bench_args, 0)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 354, in latency_test
[rank0]:     model_runner, tokenizer = load_model(server_args, tp_rank)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 133, in load_model
[rank0]:     model_runner = ModelRunner(
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 128, in __init__
[rank0]:     self.init_cuda_graphs()
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 468, in init_cuda_graphs
[rank0]:     self.cuda_graph_runner = CudaGraphRunner(self)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 153, in __init__
[rank0]:     raise Exception(
[rank0]: Exception: Capture cuda graph failed: backend='inductor' raised:
[rank0]: RuntimeError: "_local_scalar_dense_cuda" not implemented for 'Float8_e4m3fn'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants