Releases · sgl-project/sglang

19 Sep 10:09

Ying1123

v0.3.0

5ab9418

Release v0.3.0 Latest

Latest

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
Up to 1.5x lower latency with torch.compile on small batch sizes
Support for interleaved text and multi-image/video in LLaVA-OneVision
Support for interleaved window attention and 2x longer context length in Gemma-2
Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

update hyperparameter guide by @merrymercy in #1114
ci: compatible with fork repo by @zhyncs in #1115
fix: resolve Python.h header missing by @zhyncs in #1119
Fix the deadlock in multi-node tp by @merrymercy in #1122
Mixed style of chunked prefill by @hnyls2002 in #1013
Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
Fix CI accuracy && time out limit by @hnyls2002 in #1133
fix: use fp16 dtype for sm75 by @zhyncs in #1136
Improve the code style: more comments and remove useless packages by @merrymercy in #1139
Improve benchmark by @merrymercy in #1140
Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
fixed a typo by @min-xu-et in #1143
[Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
[Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
Improve docs and warnings by @merrymercy in #1164
[Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
misc: add hypervisor vendor by @zhyncs in #1165
support /v1/health using a generation 1 token by @LucienShui in #1154
fix: resolve README render by @zhyncs in #1166
[Feat] Support update weights without restart server by @shanyu-sys in #1157
Improve multi-node stability by @merrymercy in #1171
fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
[Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
Support min-p sampling by @intervitens in #1167
[Docs] Fix rendering of details in README by @Michaelvll in #1179
Improve code style of sampler by @hnyls2002 in #1168
[Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
Fix broken penalty by @hnyls2002 in #1184
Fix benchmark script by @Ying1123 in #1185
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
feat: use gelu_tanh_and_mul by @zhyncs in #1193
Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
Update README.md by @merrymercy in #1198
[CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
[Fix] the issue of random order when input is a list by @Ying1123 in #1199
Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
[Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
[Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
[Minor] Temporarily skip flaky test by @Ying1123 in #1209
[CI] Fix the issue of unit test hanging by @Ying1123 in #1211
Update CI workflows by @merrymercy in #1210
Update CI runner docs by @merrymercy in #1213
[Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
Update workflow files by @merrymercy in #1214
improve the threshold and ports in tests by @wisclmy0611 in #1215
[CI] Fix CI by @wisclmy0611 in #1217
[Fix] Multi-images loading error by @kcz358 in #1218
[Minor] improve CI and dependencies by @hnyls2002 in #1212
[CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
Move sampler into CUDA graph by @hnyls2002 in #1201
chore: bump v0.2.14 by @zhyncs in #1155
[FEAT] JSON constrained support by @havetc in #1125
Torch compile CI throughput test by @hnyls2002 in #1223
[FEAT] Support batches cancel by @caiyueliang in #1222
[Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
[FIX] Wrong logger by @havetc in #1230
feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
Fix readme by @ArtificialZeng in #1236
Fix bench latency benchmark by @hnyls2002 in #1225
[Minor] Add more type annotations by @merrymercy in #1237
feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
Update README.md by @merrymercy in #1239
hotfix: revert sampler CUDA Graph by @zhyncs in #1242
Add sglang.bench_latency to CI by @merrymercy in #1243
fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
feat: update GemmaRMSNorm by @zhyncs in #1232
Fix llava on multi images by @merrymercy in #1247
feat: replace GeluAndMul by @zhyncs in #1234
fix: resolve qwen2 moe weight loader by @zhyncs in #1252
chore: bump v0.2.14.post2 by @zhyncs in #1250
make json_schema usable from gen by @qeternity in #1254
fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
Sampler cudagraph by @hnyls2002 in #1253
fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
Transpose mla weight offline by @ispobock in #1261
EXAONE 3.0 Model Support by @Deepfocused in #1258
Update README Support Exaone 3.0 by @Deepfocused in #1267
Report median instead of mean in bench_latency.py by @merrymercy in #1269
Allow more flexible assistant and system response by @BabyChouSr in #1256
fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
[doc] fix quick start link by @ByronHsu in #1282
Optimize the update flashinfer indices by @xiaobochen123 in #1262
[CI] Add more multi-gpu tests by @merrymercy in #1280
feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
[CI] merge all ci tests into one file by @merrymercy i...

Contributors

janimo, max99x, and 28 other contributors

Assets 2

19 Sep 10:08

Ying1123

v0.2.13

5bd9537

Release v0.2.13

Highlights

New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

fix: set env in runner by @zhyncs in #891
docs: update setup runner by @zhyncs in #884
misc: update cuda graph capture exception log by @zhyncs in #894
chore: add multipart dep for fastapi by @zhyncs in #895
[minor] fixed code formatting doc by @min-xu-et in #896
Bump version to 0.2.9.post1 by @Ying1123 in #899
Update the base image of the docker by @Ying1123 in #900
Reorder CI unit tests. by @hnyls2002 in #908
fixed an error handling in bench_latency.py by @min-xu-et in #904
Add model accuracy test - step 1 by @Ying1123 in #866
latency test enhancement - part 1 by @min-xu-et in #909
Improve the structure of CI by @Ying1123 in #911
fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
enhance latency test - part 2 by @min-xu-et in #915
Make API Key OpenAI-compatible by @Ying1123 in #917
Update hyperparameter_tuning.md by @Ying1123 in #918
Fix CI && python3.8 compatible by @hnyls2002 in #920
Support more OpenAI API test by @yichuan520030910320 in #916
Bump version to 0.2.10 by @Ying1123 in #923
latency test enhancement - final part by @min-xu-et in #921
Test openai vision api by @Ying1123 in #925
Test regex in vision api by @Ying1123 in #926
Update README.md by @Ying1123 in #927
Fix prompt len in parallel sampling by @yichuan520030910320 in #928
docs: update README by @zhyncs in #935
Remove leftover auth_token by @AidanCooper in #934
Feat: add alternative choices selection methods by @AidanCooper in #835
Fix union operator by @ispobock in #940
Support multiple args options by @yichuan520030910320 in #941
Fix stuck in get_new_prefill_batch by @hnyls2002 in #948
Organize code (rename, movement) by @hnyls2002 in #953
fix nsys cannot profile cuda kernel by @mpjlu in #957
Add support for Batch API test by @yichuan520030910320 in #936
Show more error messages for warmup errors by @Ying1123 in #932
misc: update issue template by @zhyncs in #963
misc: simplify test by @yichuan520030910320 in #964
misc: add compute capability in check_env by @zhyncs in #965
Make req_pool_indices on CPU by @hnyls2002 in #960
misc: fix the req_to_token member change by @hnyls2002 in #967
chore: update vllm to 0.5.4 by @zhyncs in #966
chore: bump v0.2.11 by @zhyncs in #970
Purge self-runner's pip cache weekly by @hnyls2002 in #975
Run purge-cache only in sgl-project by @hnyls2002 in #976
misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
PrefillAdder abstraction by @hnyls2002 in #968
RadixCache method adjust by @hnyls2002 in #977
Adjust max prefix len by @hnyls2002 in #980
#590 Increase default , track changes in examples and documentation by @foszto in #971
[minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
Fix chunked prefill by @hnyls2002 in #984
Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
Adjust InputeMetadata and ScheduleBatch by @hnyls2002 in #981
support more optioin about usage in stream mode by @yichuan520030910320 in #985
Create contributor_guide.md by @Ying1123 in #992
feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
test: negative value testing for frequency, presence penalizers by @vhain in #995
support models from www.modelscope.cn by @liuyhwangyh in #994
bugfix: penalizers to be merged before reqs by @vhain in #1001
fix: resolve correctness_test issue by @zhyncs in #1002
Minor bugfix on benchmark serving by @ywang96 in #1005
Add openai embedding API by @Ying1123 in #997
Add skip_tokenizer_init args. by @gryffindor-rr in #959
Fix benchmark latency by @wisclmy0611 in #1007
Some warnings to crash when CI by @hnyls2002 in #1009
Reduce the overhead when cache is disabled by @hnyls2002 in #1010
Support embedding input as a list by @Ying1123 in #1014
misc: update test config by @zhyncs in #990
fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
Clean up unit tests by @merrymercy in #1020
Fix input_ids && rename to fill_ids by @hnyls2002 in #1021
feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
misc: update issue template by @zhyncs in #1024
Clean up readme and arguments of chunked prefill by @merrymercy in #1022
Fix wrong assert by @hnyls2002 in #1028
Improve type annotation by @merrymercy in #1029
hotfix: add CustomOp abstraction by @zhyncs in #1027
Fix the case where r.prefix_indices is None by @merrymercy in #1031
Fix triton args init by @hnyls2002 in #1034
Fix the case when max_new_tokens is too large by @merrymercy in #1025
Test the case when max_new_tokens is very large by @merrymercy in #1038
Fix the prefix indices by @hnyls2002 in #1037
Improve end-to-end throughput test and its coverage by @merrymercy in #1039
Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
minor: some potential bugs by @hnyls2002 in #1044
Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
fix...

Contributors

vhain, liuyhwangyh, and 15 other contributors

Assets 2

02 Aug 08:55

Ying1123

v0.2.9

30a9b2e

Release v0.2.9

Highlights

New feature: Chunked prefill (#800, #811)
New models: Deepseek v2
Performance improvement: vectorized logprob computation
Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
Feature fix: fixed many missing logprob-related features in the OpenAI API server
CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

Deepseek v2 support by @hnyls2002 in #693
Fix context length by @hnyls2002 in #757
docs: update model support by @zhyncs in #760
fix: not run workflows on fork repo by @zhyncs in #762
Update supported models by @hnyls2002 in #763
Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
[Minor] Improve the code style in TokenizerManager by @merrymercy in #767
Update readme by @Ying1123 in #769
feat: add fake tag by @zhyncs in #770
Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
Fix max new tokens by @merrymercy in #772
Move sampling logits to float32 by @merrymercy in #773
minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
Fix return_log_probs with cuda graph by @merrymercy in #775
Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
Allow disabling flashinfer sampling kernel by @merrymercy in #778
Bump version to 0.2.6 by @merrymercy in #779
fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
docs: init readthedocs support by @zhyncs in #783
fix: init readthedocs support by @zhyncs in #784
fix: exclude logo png in gitignore by @zhyncs in #785
docs: update index by @zhyncs in #786
Vectorize logprobs computation by @Ying1123 in #787
docs: update README by @zhyncs in #788
docs: make badges center by @zhyncs in #789
chore: add copyright for srt by @zhyncs in #790
Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
Update README.md by @Ying1123 in #792
Lazy-import third-party backends by @bgyoon in #794
Fix lazy import location by @Ying1123 in #795
Fix logging by @Ying1123 in #796
Add role documentation, add system begin & end tokens by @objnf-dev in #793
Chunked prefill support by @hnyls2002 in #797
Revert "Chunked prefill support" by @Ying1123 in #799
Chunked prefill by @hnyls2002 in #800
fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
feat: add chat template for internlm2-chat by @zhyncs in #802
Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
Organize public APIs by @hnyls2002 in #809
Remove inf value for chunked prefill size by @hnyls2002 in #812
Revert "Organize public APIs" by @Ying1123 in #815
fix: use v0.2.5 for benchmark by @zhyncs in #814
Fix LiteLLM kwargs by @qeternity in #817
Code structure refactor by @hnyls2002 in #807
docs: update README by @zhyncs in #819
Fix streaming bug by @objnf-dev in #820
feat: add runner by @zhyncs in #821
feat: add pr e2e test by @zhyncs in #822
Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
Adjust default mem fraction to avoid OOM by @Ying1123 in #823
Add awq_marlin by @Ying1123 in #826
misc: update e2e test benchmark config by @zhyncs in #825
misc: enable e2e test when push by @zhyncs in #828
docs: add set up runner by @zhyncs in #829
chore: bump v0.2.7 by @zhyncs in #830
Add --max-total-tokens by @hnyls2002 in #840
Fix List input bug by @yichuan520030910320 in #838
Add req slots leaking check by @hnyls2002 in #842
docs: update README.md by @eltociear in #843
misc: update e2e test paths config by @zhyncs in #848
chore: update flashinfer to v0.1.3 by @zhyncs in #850
Fix llama for classification by @Ying1123 in #855
Add troubleshooting doc by @Ying1123 in #856
Fix #857 by @kaifronsdal in #858
Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
misc: update e2e test paths config by @zhyncs in #860
Rename github workflows by @Ying1123 in #861
misc: disable auto release by @zhyncs in #862
misc: add cancel previous at e2e by @zhyncs in #864
Add OpenAI backend to the CI test by @Ying1123 in #869
Fix openai CI tests by @Ying1123 in #870
misc: use pip cache purge and add unit test ci by @zhyncs in #871
misc: update unit test config by @zhyncs in #873
Fix unit tests for the frontend language part by @Ying1123 in #872
bump to 0.2.8 by @Ying1123 in #877
Make scripts under /test/srt as unit tests by @Ying1123 in #875
Update runner docs by @hnyls2002 in #876
Improve the coverage of the openai api server test by @Ying1123 in #878
Implement served_model_name to customize model id when use local mode… by @dionren in #749
Update runner docs by @hnyls2002 in #879
Add more unit tests to CI by @Ying1123 in #880
Add accuracy test to CI: MMLU by @Ying1123 in #882
Update workflow name by @Ying1123 in #883
Fix the double BOS problem in the HF chat template by @Ying1123 in #888
Add benchmark: HumanEval by @Ying1123 in #889
Increase openai client limit by @Ying1123 in #886
Bump version to v0.2.9 by @Ying1123 in #890

New Contributors

@bgyoon made their first contribution in #794
@objnf-dev made their first contribution in #793
@kaifronsdal made their first contribution in #858
@dionren made their first contribution in #749

Full Changelog: v0.2.5...v0.2.9

Contributors

dionren, bgyoon, and 11 other contributors

Assets 2

26 Jul 19:56

zhyncs

v0.2.5

5bd06b4

Release v0.2.5

Highlights

We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Assets 2

25 Jul 15:58

Ying1123

v0.2.0

1a491d0

Release v0.2.0

Highlights

We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed

Optimize mem indices mangement by @hnyls2002 in #619
Unify index operations by @hnyls2002 in #620
Simplify mem state by @wisclmy0611 in #623
Improve tensor parallel performance by @Ying1123 in #625
Bump version to 0.1.21 by @Ying1123 in #626
Fix model forward grad by @hnyls2002 in #628
Update docker file by @Ying1123 in #629
Disable NCCL_NVLS by default by @Ying1123 in #631
Add qwen2 tie word embedding by @yileld in #630
Add support for VertexAI safety settings by @AidanCooper in #624
Fix vertexai by @hnyls2002 in #633
Reduce docker size by @hnyls2002 in #632
clean up step function by @Ying1123 in #635
feat: support internlm2 by @zhyncs in #636
misc: add pre-commit config by @zhyncs in #637
misc: add issue and pr template by @zhyncs in #638
Flashinfer sample kernel by @hnyls2002 in #617
Move global_server_args_dict by @hnyls2002 in #642
Increase the capacity of the memory pool by @Ying1123 in #643
feat: add check_env by @zhyncs in #645
Remove the dependency of rpyc by @wisclmy0611 in #646
misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
fix: set ulimit -n 65535 by @zhyncs in #647
feat: add lint workflow by @zhyncs in #648
fix: resolve lint error by @zhyncs in #650
Remove useless variables in infer_batch.py by @Ying1123 in #651
Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len should inherit from `server_args.conte… by @shrirajh in #654
Remove cached triton launcher by @merrymercy in #656
perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
feat: add benchmark serving by @zhyncs in #657
refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
misc: update SGLang package description by @zhyncs in #659
Update Readme by @Ying1123 in #660
feat: update check env by @zhyncs in #661
Improve docs by @Ying1123 in #662
Add benchmark instructions by @Ying1123 in #663
Fix jump forward when streaming by @hnyls2002 in #665
Fix kill process util by @ispobock in #666
Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
Update OpenAI API by @wisclmy0611 in #667
Temporary fix invalid sample results by @hnyls2002 in #668
Support random dataset in bench_serving.py by @merrymercy in #669
Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
refactor model loader: initial refactor by @Ying1123 in #664
Fix cuda graph with flashinfer by @merrymercy in #675
Tmp fix illegal sample by @hnyls2002 in #676
Update version to 0.1.22 by @Ying1123 in #677
Fallback when sampling failed by @ispobock in #678
feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
Decouple kv by @hnyls2002 in #679
Support gpt-bigcode model class by @hnyls2002 in #681
support non-streaming benchmark by @merrymercy in #682
Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
feat: update bench serving by @zhyncs in #685
misc: update output file logic by @zhyncs in #686
Allow disabling streaming in bench by @merrymercy in #687
docs: update README by @zhyncs in #688
Support Deepseek MoE Model by @hnyls2002 in #689
misc: recommend to use chat model for benchmark by @zhyncs in #690
Support Mistral-Nemo by @ispobock in #691
docs: update README by @zhyncs in #692
fix: update bench serving by @zhyncs in #694
misc: update output token logic by @zhyncs in #695
Tune params by @Ying1123 in #696
Fix trt benchmark by @Ying1123 in #697
misc: fix typo by @zhyncs in #698
Fix flashinfer by @Ying1123 in #700
Fix hf config loading by @ispobock in #702
Use min new token ratio at start by @hnyls2002 in #701
feat: add e2e latency by @zhyncs in #704
Update vllm version to support llama3.1 by @Ying1123 in #705
bump version to 0.1.23 by @Ying1123 in #706
Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
Fix multi-node deadlock by @merrymercy in #709
Auto adjust new ratio by @hnyls2002 in #708
Fix prefill size by @Ying1123 in #711
docs: update README by @zhyncs in #712
docs: update doc by @zhyncs in #713
fix: llama 3.1 405b fp8 by @zhyncs in #714
misc: update doc by @zhyncs in #715
Improve benchmark scripts by @Ying1123 in #717
Bump version to 0.1.24 by @Ying1123 in #718
docs: update supported models by @zhyncs in #719
docs: update comment by @zhyncs in #721
chore: add close inactive issues workflow by @zhyncs in #722
misc: update bulid instruction by @zhyncs in #724
fix: fp8 config by @Ying1123 in #723
Fix dockerfile and triton cache manager by @hnyls2002 in #720
chore: bump v0.1.25 by @zhyncs in #725
fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
misc: update bug issue template by @zhyncs in #727
Revert "fix: fp8 config" by @Ying1123 in #728
Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
Bump version to 0.2.0 by @Ying1123 in #730

New Contributors

@yileld made their first contribution in #630
@AidanCooper made their first contribution in #624
@zhyncs made their first contribution in #636
@shrirajh made their first contribution in #654
@yichuan520030910320 made their first contribution in https://github.com/...

Contributors

max99x, Ying1123, and 9 other contributors

Assets 2

14 Jul 00:33

merrymercy

v0.1.20

5d264a9

Release v0.1.20

Highlights

Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
Model support: Gemma2, minicpm, Qwen2 MoE
Docker support (#217 )
Various latency optimizations

What's Changed

Add docker file by @Ying1123 in #588
Add Gemma2 by @Ying1123 in #592
Format by @Ying1123 in #593
Fix Llava model by @wisclmy0611 in #594
- fix(detokenizer_manager.py): fix truncated decoded output by @Titan-p in #586
Add --enable-p2p-check option by @hnyls2002 in #599
Fix streaming by @hnyls2002 in #600
Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
add LogitsMetadata by @hnyls2002 in #604
add minicpm support by @Titan-p in #602
Make sglang compat with vllm 0.5.1 by @M0gician in #598
Add Qwen2 MoE support by @M0gician in #603
Update chat template for qwen and yi-1.5. by @for-just-we in #530
[Feat] Expose logprob options to sgl.gen API by @huyiwen in #503
Fix bench latency by @merrymercy in #607
Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
Clean up the usage of flashinfer by @merrymercy in #610
Cleanup attention backend: flashinfer and triton by @merrymercy in #611
Enable cuda graph by default by @merrymercy in #612
Improve benchmark scripts & fix llava by @merrymercy in #613
Memorypool chunked prefetch by @hnyls2002 in #614
Improve benchmark scripts by @merrymercy in #615
Fix memory pool index error by @Ying1123 in #616
Bump version to 0.1.20 by @merrymercy in #618

New Contributors

@wisclmy0611 made their first contribution in #594
@Titan-p made their first contribution in #586
@M0gician made their first contribution in #598
@for-just-we made their first contribution in #530

Full Changelog: v0.1.18...v0.1.20

Contributors

huyiwen, Ying1123, and 6 other contributors

Assets 2

04 Jul 06:35

Ying1123

v0.1.18

2f11936

Release v0.1.18

Highlight

2x large batch prefill improvement with the new flashinfer kernels #579
Multi-node tensor parallelism #550
New model support: ChatGLM #516

What's Changed

Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
[Minor] Correct Optional type hints in api by @fpreiss in #526
Add ChatGLM Model Support by @Qubitium in #516
Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
Decode Incrementally by @hnyls2002 in #517
Fix dependency by @merrymercy in #538
Fix dependency & crash issues by @Ying1123 in #539
Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
Fix tp worker only checking req[0] for stream by @Qubitium in #546
Fix the Jump-Forward with Chinese by @hnyls2002 in #551
Update fused_moe by @merrymercy in #553
Multi-node Tensor Parallelism by @Ying1123 in #550
Update flashinfer to 0.0.5 by @merrymercy in #554
Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
Fix latency benchmark by @hnyls2002 in #557
Clean up logits processor by @merrymercy in #558
Update test_flashinfer by @hnyls2002 in #560
Allow running with vllm==0.4.3 by @merrymercy in #561
Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
Add sglang.bench_latency for offline benchmark by @merrymercy in #564
Warmup cublas by @merrymercy in #566
Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
Update readme by @merrymercy in #568
Expose dtype argument by @merrymercy in #569
Update benchmark script by @Ying1123 in #571
Minor fix in compiler & format by @ZackZeng999 in #545
Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
Fix flashinfer version by @PanJason in #576
[BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
Turn on flashinfer by default by @Ying1123 in #578
fix the broken server args by @hnyls2002 in #585
2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579

New Contributors

@fpreiss made their first contribution in #524
@ZackZeng999 made their first contribution in #545
@PanJason made their first contribution in #576
@dhgarcia made their first contribution in #577

Full Changelog: v0.1.17...v0.1.18

Contributors

Qubitium, dhgarcia, and 7 other contributors

Assets 2

08 Jun 02:58

merrymercy

v0.1.17

e8a2327

Release v0.1.17

Highlights

Add data parallelim #480
Add speculative execution for OpenAI API #250
Update vllm to v0.4.3 for new quantization features #511
Better error handling (#457, #449, #514)

What's Changed

[Feat] Add llava qwen, llava mistral by @kcz358 in #419
Format code by @hnyls2002 in #441
Add finish_reason to OpenAI API by @mgerstgrasser in #446
Simplify port allocation by @merrymercy in #447
Add PUT for generate api by @Ying1123 in #448
Improve error handling & abort disconnected requests by @merrymercy in #449
Fix the broken --disable-radix-cache by @hnyls2002 in #451
openai chat speculative execution by @ChuyueSun in #250
Fix openai speculative execution by @Ying1123 in #456
Abort disconnected requests by @merrymercy in #457
Rename api_num_spec_tokens -> num_api_spec_tokens by @merrymercy in #458
Use model loader from vllm by @merrymercy in #459
port fp8 mixtral by @merrymercy in #460
fix test bug in srt_llava_next_test.py by @bingwork in #470
Add the instruction link to the LLaVA-NeXT-Video at README by @ZhangYuanhan-AI in #463
Improve logging & add logit cap by @merrymercy in #471
Optimize retract by @hnyls2002 in #440
Add benchmark scripts by @Ying1123 in #476
[Feat/Fix] Refactoring Llava models into single file by @Luodian in #475
Improve benchmark scripts & rename some scripts by @merrymercy in #477
Improve benchmark scripts & add more models by @merrymercy in #484
Support data parallelism (static) by @Ying1123 in #480
Make the server random by default by @merrymercy in #488
Revert "Make the server random by default" by @Ying1123 in #492
update the script: examples/usage/llava_video/srt_example_llava_v.sh by @ZhangYuanhan-AI in #491
Make the server random by default by @merrymercy in #493
Update vllm to v0.4.3 by @merrymercy in #511
remove redundant pad_input_ids function by @amosyou in #500
Litellm Backend by @huyiwen in #502
Fix rid state map leak + Refractor .finished by @Qubitium in #505
Crash the server when error or OOM happens by @merrymercy in #514
Update version to 0.1.17 by @merrymercy in #515

New Contributors

@kcz358 made their first contribution in #419
@mgerstgrasser made their first contribution in #446
@bingwork made their first contribution in #470
@amosyou made their first contribution in #500
@huyiwen made their first contribution in #502

Full Changelog: v0.1.16...v0.1.17

Contributors

Qubitium, huyiwen, and 10 other contributors

Assets 2

14 May 00:36

merrymercy

v0.1.16

e0ae5d4

v0.1.16

Highlight

Support more models: DBRX, Command-R, Gemma
Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)
Cache performance improvements (#418, #364)
Marlin quantization kernels
Many bug fixes
Update dependencies to be compatible with their latest versions

What's Changed

Fix Runtime missing some ServerArgs options by @Qubitium in #281
adding the triton docker build minimal example by @amirarsalan90 in #242
Fix flashinfer >= 0.0.3 compat by @Qubitium in #282
Fix Incorrect CURL Request Example in README by @amirarsalan90 in #287
enable marlin kernels by @qeternity in #286
Fix env (docker) compat due to file usage by @Qubitium in #288
Fix marlin model loading compat with autogptq by @Liurl21 in #290
Fix outlines-0.0.35 incompatibility by @ZhouGongZaiShi in #291
[Fix/Potential Bugs] Can not correctly import models in python/sglang/srt/models by @Luodian in #311
Use Anthropic messages API by @janimo in #304
Add StableLM model. by @janimo in #301
Support oai in benchmark/mmlu by @merrymercy in #323
Update version to v0.1.14 by @merrymercy in #324
Cleanup codebase: removed unnecessary code/logic by @Qubitium in #298
Update dependencies by @janimo in #326
Openrouter usage example by @janimo in #327
model_rpc style improvement by @hnyls2002 in #293
model_runner simplify by @hnyls2002 in #329
Logprobs Refractor by @hnyls2002 in #331
DBRX support by @hnyls2002 in #337
Add support for new autogptq quant_config.checkpoint_format by @Qubitium in #332
Fix llava parallelism/fork bug by @lockon-n in #315
Eliminate 2 gpu ops during sampling when logit_bias is zero by @hnyls2002 in #343
Revert "Eliminate 2 gpu ops during sampling when logit_bias is zero" by @hnyls2002 in #345
Eliminate 2 gpu ops during sampling when logit_bias is zero by @Qubitium in #338
Add timeout to get_meta_info by @SimoneRaponi in #346
Fix typos in infer_batch.py by @tom-doerr in #354
Time cost utils by @hnyls2002 in #355
Update README.md by @eltociear in #358
support command-r by @ZhouXingg in #369
Fix issue #367 – System message not supported for Anthropic (anthropic.BadRequestError) by @fronx in #368
Update model support in readme by @Ying1123 in #370
Optimize radix tree matching by @ispobock in #364
Reduce overhead when fork(1) by @hnyls2002 in #375
llama3 instruct template by @qeternity in #372
add .isort.cfg by @hnyls2002 in #378
Revert removing the unused imports by @hnyls2002 in #385
Benchmark Updates by @hnyls2002 in #382
Improve performance when running with full parallel by @hnyls2002 in #394
Minor: style improvement of radix_cache and memory_pool by @hnyls2002 in #395
Format Benchmark Code by @hnyls2002 in #399
Fix chatml template by @merrymercy in #406
Adding RAG tracing & eval cookbook using Parea by @joschkabraun in #390
SamplingParams add "spaces_between_special_tokens" argument by @ZhouXingg in #392
Organize Benchmark by @hnyls2002 in #381
Add Cohere Command R chat template by @noah-kim-theori in #411
Fix sync() when fork(1) by @hnyls2002 in #412
Include finish reason in meta info response by @qeternity in #415
Make public APIs more standard. by @hnyls2002 in #416
Compat with latest VLLM 0.4.2 main + fork.number rename + Flashinfer 0.0.4 by @Qubitium in #380
Optimize the memory usage of logits processor by @merrymercy in #420
Clean up by @merrymercy in #422
Fix logit processor bugs by @merrymercy in #427
Minor fix for the import path by @merrymercy in #428
Move openai api server into a separate file by @merrymercy in #429
Fix flashinfer by @merrymercy in #430
Update version to 0.1.15 by @merrymercy in #431
Misc fixes by @merrymercy in #432
Allow input_ids in the input of the /generate endpoint by @lolipopshock in #363
Improve error handling by @merrymercy in #433
Cache optimizations by @hnyls2002 in #418
Update readme by @merrymercy in #434
Raise errors for prompts that are too long by @merrymercy in #436
support llava video by @ZhangYuanhan-AI in #426
Fix streaming by @merrymercy in #437
Update version to 0.1.16 by @merrymercy in #438

New Contributors

@Qubitium made their first contribution in #281
@amirarsalan90 made their first contribution in #242
@Liurl21 made their first contribution in #290
@ZhouGongZaiShi made their first contribution in #291
@Luodian made their first contribution in #311
@janimo made their first contribution in #304
@lockon-n made their first contribution in #315
@SimoneRaponi made their first contribution in #346
@tom-doerr made their first contribution in #354
@ZhouXingg made their first contribution in #369
@fronx made their first contribution in #368
@ispobock made their first contribution in #364
@joschkabraun made their first contribution in #390
@noah-kim-theori made their first contribution in #411
@lolipopshock made their first contribution in #363
@ZhangYuanhan-AI made their first contribution in #426

Full Changelog: v0.1.13...v0.1.16

Contributors

fronx, janimo, and 19 other contributors

Assets 2

11 Mar 12:52

merrymercy

v0.1.13

4aa5dd2

Release v0.1.13

Highlights

Gemma Support by @hnyls2002 in #256
Add Together and AzureOpenAI examples by @merrymercy in #184

What's Changed

correct a mistake on the README.md by @yaya-sy in #182
correct reference dtype openai.py by @yaya-sy in #181
Add Together and AzureOpenAI examples by @merrymercy in #184
Fix server launch for jupyter notebook by @merrymercy in #186
Refactor decoding logprob and add completion_tokens_wo_jump_forward by @comaniac in #189
Pin outlines version by @comaniac in #196
Adjust outlines version. by @hnyls2002 in #200
Update README.md by @eltociear in #207
Added the ability to Modify the Context Length by @psych0v0yager in #210
Fix logprobs with logprob_start_len by @comaniac in #193
Support outlines > 0.0.31 by @comaniac in #219
Fix stop str merging by @hnyls2002 in #225
Fix interpreter.py get_var(var_name) in text iter when stream is not enabled by @exceedzhang in #198
fix chatml template by @qeternity in #195
Upload agent_calls.jsonl download link by @hnyls2002 in #226
Fix addr reuse in check_port by @hnyls2002 in #253
Add SSL Cert Functionality by @nivibilla in #224
Refactor ChatTemplate for Enhanced Clarity and Efficiency by @cubxxw in #201
Add set_var to interpreter.py by @1024th in #263
Add logo by @merrymercy in #275
Fix qwen config by @hnyls2002 in #261
replace skip_embed with input_embeds by @TideDra in #222
Gemma Support by @hnyls2002 in #256
Improve gemma and documentations by @merrymercy in #278
Organize server_args by @hnyls2002 in #277
Add Support for API Key Authentication by @alessiodallapiazza in #230
Fix RuntimeEndpoint by @merrymercy in #279
Update version to v0.1.13 by @merrymercy in #280

New Contributors

@psych0v0yager made their first contribution in #210
@exceedzhang made their first contribution in #198
@qeternity made their first contribution in #195
@cubxxw made their first contribution in #201
@1024th made their first contribution in #263
@TideDra made their first contribution in #222
@alessiodallapiazza made their first contribution in #230

Full Changelog: v0.1.12...v0.1.13

Contributors

alessiodallapiazza, exceedzhang, and 11 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

Highlights

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlight

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlight

What's Changed

New Contributors

Contributors

Highlights

What's Changed

New Contributors

Contributors

Releases: sgl-project/sglang

Release v0.3.0

Highlights

What's Changed

Contributors

Release v0.2.13

Highlights

What's Changed

Contributors

Release v0.2.9

Highlights

What's Changed

New Contributors

Contributors

Release v0.2.5

Highlights

Release v0.2.0

Highlights

What's Changed

New Contributors

Contributors

Release v0.1.20

Highlights

What's Changed

New Contributors

Contributors

Release v0.1.18

Highlight

What's Changed

New Contributors

Contributors

Release v0.1.17

Highlights

What's Changed

New Contributors

Contributors

v0.1.16

Highlight

What's Changed

New Contributors

Contributors

Release v0.1.13

Highlights

What's Changed

New Contributors

Contributors