MoE torch compile #1497

ispobock · 2024-09-24T00:43:46Z

Motivation

Temporarily workaround MoE torch compile with monkey patch.

Bench Latency

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-V2-Lite --disable-radix --trust-remote-code --input-len 128 --output-len 8 --batch 1 --enable-torch-compile --max-torch-compile-bs 1

# bs=1, w/o torch compile
Decode.  median latency: 0.00921 s, median throughput:    108.61 token/s
Total. latency:  0.101 s, throughput:   1352.80 token/s

# bs=1, w/ torch compile, skip moe (main)
Decode.  median latency: 0.00735 s, median throughput:    136.13 token/s
Total. latency:  0.086 s, throughput:   1574.70 token/s

# bs=1, w/ torch compile + moe (this PR)
Decode.  median latency: 0.00682 s, median throughput:    146.71 token/s
Total. latency:  0.083 s, throughput:   1632.26 token/s

Evaluation

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 30000 --trust-remote-code --host 0.0.0.0  --enable-torch-compile --max-torch-compile-bs 1 --max-running-requests 1
python3 benchmark/gsm8k/bench_sglang.py --num-questions 200

Accuracy: 0.825
Invalid: 0.000
Latency: 189.962 s
Output throughput: 136.559 token/s

HaiShaw · 2024-09-29T06:16:32Z

@ispobock , On v0.3.2, I see an issue if add --quant fp8, as below:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 511, in <module>
[rank0]:     raise e
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 509, in <module>
[rank0]:     main(server_args, bench_args)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 472, in main
[rank0]:     work_func(server_args, bench_args, 0)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 354, in latency_test
[rank0]:     model_runner, tokenizer = load_model(server_args, tp_rank)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/bench_latency.py", line 133, in load_model
[rank0]:     model_runner = ModelRunner(
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 128, in __init__
[rank0]:     self.init_cuda_graphs()
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 468, in init_cuda_graphs
[rank0]:     self.cuda_graph_runner = CudaGraphRunner(self)
[rank0]:   File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 153, in __init__
[rank0]:     raise Exception(
[rank0]: Exception: Capture cuda graph failed: backend='inductor' raised:
[rank0]: RuntimeError: "_local_scalar_dense_cuda" not implemented for 'Float8_e4m3fn'

moe torch compile

ddd08aa

ispobock requested review from merrymercy, hnyls2002, Ying1123 and zhyncs September 24, 2024 00:46

merrymercy merged commit 8d4ed42 into sgl-project:main Sep 24, 2024
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE torch compile #1497

MoE torch compile #1497

ispobock commented Sep 24, 2024 •

edited

Loading

HaiShaw commented Sep 29, 2024

MoE torch compile #1497

MoE torch compile #1497

Conversation

ispobock commented Sep 24, 2024 • edited Loading

Motivation

Bench Latency

Evaluation

HaiShaw commented Sep 29, 2024

ispobock commented Sep 24, 2024 •

edited

Loading