Releases: sgl-project/sglang
Release v0.3.0
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in #1114
- ci: compatible with fork repo by @zhyncs in #1115
- fix: resolve Python.h header missing by @zhyncs in #1119
- Fix the deadlock in multi-node tp by @merrymercy in #1122
- Mixed style of chunked prefill by @hnyls2002 in #1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
- Fix CI accuracy && time out limit by @hnyls2002 in #1133
- fix: use fp16 dtype for sm75 by @zhyncs in #1136
- Improve the code style: more comments and remove useless packages by @merrymercy in #1139
- Improve benchmark by @merrymercy in #1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
- fixed a typo by @min-xu-et in #1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
- Improve docs and warnings by @merrymercy in #1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
- misc: add hypervisor vendor by @zhyncs in #1165
- support /v1/health using a generation 1 token by @LucienShui in #1154
- fix: resolve README render by @zhyncs in #1166
- [Feat] Support update weights without restart server by @shanyu-sys in #1157
- Improve multi-node stability by @merrymercy in #1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
- Support min-p sampling by @intervitens in #1167
- [Docs] Fix rendering of details in README by @Michaelvll in #1179
- Improve code style of sampler by @hnyls2002 in #1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
- Fix broken penalty by @hnyls2002 in #1184
- Fix benchmark script by @Ying1123 in #1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
- feat: use gelu_tanh_and_mul by @zhyncs in #1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
- Update README.md by @merrymercy in #1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
- [Fix] the issue of random order when input is a list by @Ying1123 in #1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
- [Minor] Temporarily skip flaky test by @Ying1123 in #1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in #1211
- Update CI workflows by @merrymercy in #1210
- Update CI runner docs by @merrymercy in #1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
- Update workflow files by @merrymercy in #1214
- improve the threshold and ports in tests by @wisclmy0611 in #1215
- [CI] Fix CI by @wisclmy0611 in #1217
- [Fix] Multi-images loading error by @kcz358 in #1218
- [Minor] improve CI and dependencies by @hnyls2002 in #1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
- Move sampler into CUDA graph by @hnyls2002 in #1201
- chore: bump v0.2.14 by @zhyncs in #1155
- [FEAT] JSON constrained support by @havetc in #1125
- Torch compile CI throughput test by @hnyls2002 in #1223
- [FEAT] Support batches cancel by @caiyueliang in #1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
- [FIX] Wrong logger by @havetc in #1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
- Fix readme by @ArtificialZeng in #1236
- Fix bench latency benchmark by @hnyls2002 in #1225
- [Minor] Add more type annotations by @merrymercy in #1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
- Update README.md by @merrymercy in #1239
- hotfix: revert sampler CUDA Graph by @zhyncs in #1242
- Add sglang.bench_latency to CI by @merrymercy in #1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
- feat: update GemmaRMSNorm by @zhyncs in #1232
- Fix llava on multi images by @merrymercy in #1247
- feat: replace GeluAndMul by @zhyncs in #1234
- fix: resolve qwen2 moe weight loader by @zhyncs in #1252
- chore: bump v0.2.14.post2 by @zhyncs in #1250
- make json_schema usable from gen by @qeternity in #1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
- Sampler cudagraph by @hnyls2002 in #1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
- Transpose mla weight offline by @ispobock in #1261
- EXAONE 3.0 Model Support by @Deepfocused in #1258
- Update README Support Exaone 3.0 by @Deepfocused in #1267
- Report median instead of mean in bench_latency.py by @merrymercy in #1269
- Allow more flexible assistant and system response by @BabyChouSr in #1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
- [doc] fix quick start link by @ByronHsu in #1282
- Optimize the update flashinfer indices by @xiaobochen123 in #1262
- [CI] Add more multi-gpu tests by @merrymercy in #1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
- [CI] merge all ci tests into one file by @merrymercy i...
Release v0.2.13
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in #891
- docs: update setup runner by @zhyncs in #884
- misc: update cuda graph capture exception log by @zhyncs in #894
- chore: add multipart dep for fastapi by @zhyncs in #895
- [minor] fixed code formatting doc by @min-xu-et in #896
- Bump version to 0.2.9.post1 by @Ying1123 in #899
- Update the base image of the docker by @Ying1123 in #900
- Reorder CI unit tests. by @hnyls2002 in #908
- fixed an error handling in bench_latency.py by @min-xu-et in #904
- Add model accuracy test - step 1 by @Ying1123 in #866
- latency test enhancement - part 1 by @min-xu-et in #909
- Improve the structure of CI by @Ying1123 in #911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
- enhance latency test - part 2 by @min-xu-et in #915
- Make API Key OpenAI-compatible by @Ying1123 in #917
- Update hyperparameter_tuning.md by @Ying1123 in #918
- Fix CI && python3.8 compatible by @hnyls2002 in #920
- Support more OpenAI API test by @yichuan520030910320 in #916
- Bump version to 0.2.10 by @Ying1123 in #923
- latency test enhancement - final part by @min-xu-et in #921
- Test openai vision api by @Ying1123 in #925
- Test regex in vision api by @Ying1123 in #926
- Update README.md by @Ying1123 in #927
- Fix prompt len in parallel sampling by @yichuan520030910320 in #928
- docs: update README by @zhyncs in #935
- Remove leftover auth_token by @AidanCooper in #934
- Feat: add alternative choices selection methods by @AidanCooper in #835
- Fix union operator by @ispobock in #940
- Support multiple args options by @yichuan520030910320 in #941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in #948 - Organize code (rename, movement) by @hnyls2002 in #953
- fix nsys cannot profile cuda kernel by @mpjlu in #957
- Add support for Batch API test by @yichuan520030910320 in #936
- Show more error messages for warmup errors by @Ying1123 in #932
- misc: update issue template by @zhyncs in #963
- misc: simplify test by @yichuan520030910320 in #964
- misc: add compute capability in check_env by @zhyncs in #965
- Make
req_pool_indices
on CPU by @hnyls2002 in #960 - misc: fix the req_to_token member change by @hnyls2002 in #967
- chore: update vllm to 0.5.4 by @zhyncs in #966
- chore: bump v0.2.11 by @zhyncs in #970
- Purge self-runner's pip cache weekly by @hnyls2002 in #975
- Run purge-cache only in sgl-project by @hnyls2002 in #976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
- PrefillAdder abstraction by @hnyls2002 in #968
- RadixCache method adjust by @hnyls2002 in #977
- Adjust max prefix len by @hnyls2002 in #980
- #590 Increase default , track changes in examples and documentation by @foszto in #971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
- Fix chunked prefill by @hnyls2002 in #984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in #981 - support more optioin about usage in stream mode by @yichuan520030910320 in #985
- Create contributor_guide.md by @Ying1123 in #992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
- test: negative value testing for frequency, presence penalizers by @vhain in #995
- support models from www.modelscope.cn by @liuyhwangyh in #994
- bugfix: penalizers to be merged before reqs by @vhain in #1001
- fix: resolve correctness_test issue by @zhyncs in #1002
- Minor bugfix on benchmark serving by @ywang96 in #1005
- Add openai embedding API by @Ying1123 in #997
- Add skip_tokenizer_init args. by @gryffindor-rr in #959
- Fix benchmark latency by @wisclmy0611 in #1007
- Some warnings to crash when CI by @hnyls2002 in #1009
- Reduce the overhead when cache is disabled by @hnyls2002 in #1010
- Support embedding input as a list by @Ying1123 in #1014
- misc: update test config by @zhyncs in #990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
- Clean up unit tests by @merrymercy in #1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in #1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
- misc: update issue template by @zhyncs in #1024
- Clean up readme and arguments of chunked prefill by @merrymercy in #1022
- Fix wrong assert by @hnyls2002 in #1028
- Improve type annotation by @merrymercy in #1029
- hotfix: add CustomOp abstraction by @zhyncs in #1027
- Fix the case where r.prefix_indices is None by @merrymercy in #1031
- Fix triton args init by @hnyls2002 in #1034
- Fix the case when max_new_tokens is too large by @merrymercy in #1025
- Test the case when max_new_tokens is very large by @merrymercy in #1038
- Fix the prefix indices by @hnyls2002 in #1037
- Improve end-to-end throughput test and its coverage by @merrymercy in #1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
- minor: some potential bugs by @hnyls2002 in #1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
- fix...
Release v0.2.9
Highlights
- New feature: Chunked prefill (#800, #811)
- New models: Deepseek v2
- Performance improvement: vectorized logprob computation
- Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- Feature fix: fixed many missing logprob-related features in the OpenAI API server
- CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.
What's Changed
- Deepseek v2 support by @hnyls2002 in #693
- Fix context length by @hnyls2002 in #757
- docs: update model support by @zhyncs in #760
- fix: not run workflows on fork repo by @zhyncs in #762
- Update supported models by @hnyls2002 in #763
- Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
- [Minor] Improve the code style in TokenizerManager by @merrymercy in #767
- Update readme by @Ying1123 in #769
- feat: add fake tag by @zhyncs in #770
- Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
- Fix max new tokens by @merrymercy in #772
- Move sampling logits to float32 by @merrymercy in #773
- minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
- Fix return_log_probs with cuda graph by @merrymercy in #775
- Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
- Allow disabling flashinfer sampling kernel by @merrymercy in #778
- Bump version to 0.2.6 by @merrymercy in #779
- fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
- docs: init readthedocs support by @zhyncs in #783
- fix: init readthedocs support by @zhyncs in #784
- fix: exclude logo png in gitignore by @zhyncs in #785
- docs: update index by @zhyncs in #786
- Vectorize logprobs computation by @Ying1123 in #787
- docs: update README by @zhyncs in #788
- docs: make badges center by @zhyncs in #789
- chore: add copyright for srt by @zhyncs in #790
- Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
- Update README.md by @Ying1123 in #792
- Lazy-import third-party backends by @bgyoon in #794
- Fix lazy import location by @Ying1123 in #795
- Fix logging by @Ying1123 in #796
- Add role documentation, add system begin & end tokens by @objnf-dev in #793
- Chunked prefill support by @hnyls2002 in #797
- Revert "Chunked prefill support" by @Ying1123 in #799
- Chunked prefill by @hnyls2002 in #800
- fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
- Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
- feat: add chat template for internlm2-chat by @zhyncs in #802
- Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
- Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
- Organize public APIs by @hnyls2002 in #809
- Remove inf value for chunked prefill size by @hnyls2002 in #812
- Revert "Organize public APIs" by @Ying1123 in #815
- fix: use v0.2.5 for benchmark by @zhyncs in #814
- Fix LiteLLM kwargs by @qeternity in #817
- Code structure refactor by @hnyls2002 in #807
- docs: update README by @zhyncs in #819
- Fix streaming bug by @objnf-dev in #820
- feat: add runner by @zhyncs in #821
- feat: add pr e2e test by @zhyncs in #822
- Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
- Adjust default mem fraction to avoid OOM by @Ying1123 in #823
- Add awq_marlin by @Ying1123 in #826
- misc: update e2e test benchmark config by @zhyncs in #825
- misc: enable e2e test when push by @zhyncs in #828
- docs: add set up runner by @zhyncs in #829
- chore: bump v0.2.7 by @zhyncs in #830
- Add
--max-total-tokens
by @hnyls2002 in #840 - Fix List input bug by @yichuan520030910320 in #838
- Add req slots leaking check by @hnyls2002 in #842
- docs: update README.md by @eltociear in #843
- misc: update e2e test paths config by @zhyncs in #848
- chore: update flashinfer to v0.1.3 by @zhyncs in #850
- Fix llama for classification by @Ying1123 in #855
- Add troubleshooting doc by @Ying1123 in #856
- Fix #857 by @kaifronsdal in #858
- Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
- Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
- misc: update e2e test paths config by @zhyncs in #860
- Rename github workflows by @Ying1123 in #861
- misc: disable auto release by @zhyncs in #862
- misc: add cancel previous at e2e by @zhyncs in #864
- Add OpenAI backend to the CI test by @Ying1123 in #869
- Fix openai CI tests by @Ying1123 in #870
- misc: use pip cache purge and add unit test ci by @zhyncs in #871
- misc: update unit test config by @zhyncs in #873
- Fix unit tests for the frontend language part by @Ying1123 in #872
- bump to 0.2.8 by @Ying1123 in #877
- Make scripts under
/test/srt
as unit tests by @Ying1123 in #875 - Update runner docs by @hnyls2002 in #876
- Improve the coverage of the openai api server test by @Ying1123 in #878
- Implement served_model_name to customize model id when use local mode… by @dionren in #749
- Update runner docs by @hnyls2002 in #879
- Add more unit tests to CI by @Ying1123 in #880
- Add accuracy test to CI: MMLU by @Ying1123 in #882
- Update workflow name by @Ying1123 in #883
- Fix the double BOS problem in the HF chat template by @Ying1123 in #888
- Add benchmark: HumanEval by @Ying1123 in #889
- Increase openai client limit by @Ying1123 in #886
- Bump version to v0.2.9 by @Ying1123 in #890
New Contributors
- @bgyoon made their first contribution in #794
- @objnf-dev made their first contribution in #793
- @kaifronsdal made their first contribution in #858
- @dionren made their first contribution in #749
Full Changelog: v0.2.5...v0.2.9
Release v0.2.5
Highlights
-
We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
-
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
-
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!
Release v0.2.0
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in #619
- Unify index operations by @hnyls2002 in #620
- Simplify mem state by @wisclmy0611 in #623
- Improve tensor parallel performance by @Ying1123 in #625
- Bump version to 0.1.21 by @Ying1123 in #626
- Fix model forward grad by @hnyls2002 in #628
- Update docker file by @Ying1123 in #629
- Disable NCCL_NVLS by default by @Ying1123 in #631
- Add qwen2 tie word embedding by @yileld in #630
- Add support for VertexAI safety settings by @AidanCooper in #624
- Fix vertexai by @hnyls2002 in #633
- Reduce docker size by @hnyls2002 in #632
- clean up step function by @Ying1123 in #635
- feat: support internlm2 by @zhyncs in #636
- misc: add pre-commit config by @zhyncs in #637
- misc: add issue and pr template by @zhyncs in #638
- Flashinfer sample kernel by @hnyls2002 in #617
- Move
global_server_args_dict
by @hnyls2002 in #642 - Increase the capacity of the memory pool by @Ying1123 in #643
- feat: add check_env by @zhyncs in #645
- Remove the dependency of rpyc by @wisclmy0611 in #646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
- fix: set ulimit -n 65535 by @zhyncs in #647
- feat: add lint workflow by @zhyncs in #648
- fix: resolve lint error by @zhyncs in #650
- Remove useless variables in infer_batch.py by @Ying1123 in #651
- Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in #654- Remove cached triton launcher by @merrymercy in #656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
- feat: add benchmark serving by @zhyncs in #657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
- misc: update SGLang package description by @zhyncs in #659
- Update Readme by @Ying1123 in #660
- feat: update check env by @zhyncs in #661
- Improve docs by @Ying1123 in #662
- Add benchmark instructions by @Ying1123 in #663
- Fix jump forward when streaming by @hnyls2002 in #665
- Fix kill process util by @ispobock in #666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
- Update OpenAI API by @wisclmy0611 in #667
- Temporary fix invalid sample results by @hnyls2002 in #668
- Support random dataset in bench_serving.py by @merrymercy in #669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
- refactor model loader: initial refactor by @Ying1123 in #664
- Fix cuda graph with flashinfer by @merrymercy in #675
- Tmp fix illegal sample by @hnyls2002 in #676
- Update version to 0.1.22 by @Ying1123 in #677
- Fallback when sampling failed by @ispobock in #678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
- Decouple kv by @hnyls2002 in #679
- Support gpt-bigcode model class by @hnyls2002 in #681
- support non-streaming benchmark by @merrymercy in #682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
- feat: update bench serving by @zhyncs in #685
- misc: update output file logic by @zhyncs in #686
- Allow disabling streaming in bench by @merrymercy in #687
- docs: update README by @zhyncs in #688
- Support Deepseek MoE Model by @hnyls2002 in #689
- misc: recommend to use chat model for benchmark by @zhyncs in #690
- Support Mistral-Nemo by @ispobock in #691
- docs: update README by @zhyncs in #692
- fix: update bench serving by @zhyncs in #694
- misc: update output token logic by @zhyncs in #695
- Tune params by @Ying1123 in #696
- Fix trt benchmark by @Ying1123 in #697
- misc: fix typo by @zhyncs in #698
- Fix flashinfer by @Ying1123 in #700
- Fix hf config loading by @ispobock in #702
- Use min new token ratio at start by @hnyls2002 in #701
- feat: add e2e latency by @zhyncs in #704
- Update vllm version to support llama3.1 by @Ying1123 in #705
- bump version to 0.1.23 by @Ying1123 in #706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
- Fix multi-node deadlock by @merrymercy in #709
- Auto adjust new ratio by @hnyls2002 in #708
- Fix prefill size by @Ying1123 in #711
- docs: update README by @zhyncs in #712
- docs: update doc by @zhyncs in #713
- fix: llama 3.1 405b fp8 by @zhyncs in #714
- misc: update doc by @zhyncs in #715
- Improve benchmark scripts by @Ying1123 in #717
- Bump version to 0.1.24 by @Ying1123 in #718
- docs: update supported models by @zhyncs in #719
- docs: update comment by @zhyncs in #721
- chore: add close inactive issues workflow by @zhyncs in #722
- misc: update bulid instruction by @zhyncs in #724
- fix: fp8 config by @Ying1123 in #723
- Fix dockerfile and triton cache manager by @hnyls2002 in #720
- chore: bump v0.1.25 by @zhyncs in #725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
- misc: update bug issue template by @zhyncs in #727
- Revert "fix: fp8 config" by @Ying1123 in #728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
- Bump version to 0.2.0 by @Ying1123 in #730
New Contributors
- @yileld made their first contribution in #630
- @AidanCooper made their first contribution in #624
- @zhyncs made their first contribution in #636
- @shrirajh made their first contribution in #654
- @yichuan520030910320 made their first contribution in https://github.com/...
Release v0.1.20
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in #588
- Add Gemma2 by @Ying1123 in #592
- Format by @Ying1123 in #593
- Fix Llava model by @wisclmy0611 in #594
- Add
--enable-p2p-check
option by @hnyls2002 in #599 - Fix streaming by @hnyls2002 in #600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
- add
LogitsMetadata
by @hnyls2002 in #604 - add minicpm support by @Titan-p in #602
- Make sglang compat with vllm 0.5.1 by @M0gician in #598
- Add Qwen2 MoE support by @M0gician in #603
- Update chat template for qwen and yi-1.5. by @for-just-we in #530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in #503 - Fix bench latency by @merrymercy in #607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
- Clean up the usage of flashinfer by @merrymercy in #610
- Cleanup attention backend: flashinfer and triton by @merrymercy in #611
- Enable cuda graph by default by @merrymercy in #612
- Improve benchmark scripts & fix llava by @merrymercy in #613
- Memorypool chunked prefetch by @hnyls2002 in #614
- Improve benchmark scripts by @merrymercy in #615
- Fix memory pool index error by @Ying1123 in #616
- Bump version to 0.1.20 by @merrymercy in #618
New Contributors
- @wisclmy0611 made their first contribution in #594
- @Titan-p made their first contribution in #586
- @M0gician made their first contribution in #598
- @for-just-we made their first contribution in #530
Full Changelog: v0.1.18...v0.1.20
Release v0.1.18
Highlight
- 2x large batch prefill improvement with the new flashinfer kernels #579
- Multi-node tensor parallelism #550
- New model support: ChatGLM #516
What's Changed
- Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
- Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
- [Minor] Correct Optional type hints in api by @fpreiss in #526
- Add ChatGLM Model Support by @Qubitium in #516
- Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
- Decode Incrementally by @hnyls2002 in #517
- Fix dependency by @merrymercy in #538
- Fix dependency & crash issues by @Ying1123 in #539
- Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
- Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
- Fix tp worker only checking req[0] for stream by @Qubitium in #546
- Fix the Jump-Forward with Chinese by @hnyls2002 in #551
- Update fused_moe by @merrymercy in #553
- Multi-node Tensor Parallelism by @Ying1123 in #550
- Update flashinfer to 0.0.5 by @merrymercy in #554
- Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
- Fix latency benchmark by @hnyls2002 in #557
- Clean up logits processor by @merrymercy in #558
- Update test_flashinfer by @hnyls2002 in #560
- Allow running with vllm==0.4.3 by @merrymercy in #561
- Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
- Add sglang.bench_latency for offline benchmark by @merrymercy in #564
- Warmup cublas by @merrymercy in #566
- Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
- Update readme by @merrymercy in #568
- Expose dtype argument by @merrymercy in #569
- Update benchmark script by @Ying1123 in #571
- Minor fix in compiler & format by @ZackZeng999 in #545
- Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
- Fix flashinfer version by @PanJason in #576
- [BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
- Turn on flashinfer by default by @Ying1123 in #578
- fix the broken server args by @hnyls2002 in #585
- 2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579
New Contributors
- @fpreiss made their first contribution in #524
- @ZackZeng999 made their first contribution in #545
- @PanJason made their first contribution in #576
- @dhgarcia made their first contribution in #577
Full Changelog: v0.1.17...v0.1.18
Release v0.1.17
Highlights
- Add data parallelim #480
- Add speculative execution for OpenAI API #250
- Update vllm to v0.4.3 for new quantization features #511
- Better error handling (#457, #449, #514)
What's Changed
- [Feat] Add llava qwen, llava mistral by @kcz358 in #419
- Format code by @hnyls2002 in #441
- Add finish_reason to OpenAI API by @mgerstgrasser in #446
- Simplify port allocation by @merrymercy in #447
- Add PUT for generate api by @Ying1123 in #448
- Improve error handling & abort disconnected requests by @merrymercy in #449
- Fix the broken
--disable-radix-cache
by @hnyls2002 in #451 - openai chat speculative execution by @ChuyueSun in #250
- Fix openai speculative execution by @Ying1123 in #456
- Abort disconnected requests by @merrymercy in #457
- Rename api_num_spec_tokens -> num_api_spec_tokens by @merrymercy in #458
- Use model loader from vllm by @merrymercy in #459
- port fp8 mixtral by @merrymercy in #460
- fix test bug in srt_llava_next_test.py by @bingwork in #470
- Add the instruction link to the LLaVA-NeXT-Video at README by @ZhangYuanhan-AI in #463
- Improve logging & add logit cap by @merrymercy in #471
- Optimize retract by @hnyls2002 in #440
- Add benchmark scripts by @Ying1123 in #476
- [Feat/Fix] Refactoring Llava models into single file by @Luodian in #475
- Improve benchmark scripts & rename some scripts by @merrymercy in #477
- Improve benchmark scripts & add more models by @merrymercy in #484
- Support data parallelism (static) by @Ying1123 in #480
- Make the server random by default by @merrymercy in #488
- Revert "Make the server random by default" by @Ying1123 in #492
- update the script: examples/usage/llava_video/srt_example_llava_v.sh by @ZhangYuanhan-AI in #491
- Make the server random by default by @merrymercy in #493
- Update vllm to v0.4.3 by @merrymercy in #511
- remove redundant pad_input_ids function by @amosyou in #500
- Litellm Backend by @huyiwen in #502
- Fix rid state map leak + Refractor .finished by @Qubitium in #505
- Crash the server when error or OOM happens by @merrymercy in #514
- Update version to 0.1.17 by @merrymercy in #515
New Contributors
- @kcz358 made their first contribution in #419
- @mgerstgrasser made their first contribution in #446
- @bingwork made their first contribution in #470
- @amosyou made their first contribution in #500
- @huyiwen made their first contribution in #502
Full Changelog: v0.1.16...v0.1.17
v0.1.16
Highlight
- Support more models: DBRX, Command-R, Gemma
- Support llava-video (#423, https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)
- Cache performance improvements (#418, #364)
- Marlin quantization kernels
- Many bug fixes
- Update dependencies to be compatible with their latest versions
What's Changed
- Fix Runtime missing some ServerArgs options by @Qubitium in #281
- adding the triton docker build minimal example by @amirarsalan90 in #242
- Fix flashinfer >= 0.0.3 compat by @Qubitium in #282
- Fix Incorrect CURL Request Example in README by @amirarsalan90 in #287
- enable marlin kernels by @qeternity in #286
- Fix env (docker) compat due to file usage by @Qubitium in #288
- Fix marlin model loading compat with autogptq by @Liurl21 in #290
- Fix outlines-0.0.35 incompatibility by @ZhouGongZaiShi in #291
- [Fix/Potential Bugs] Can not correctly import models in python/sglang/srt/models by @Luodian in #311
- Use Anthropic messages API by @janimo in #304
- Add StableLM model. by @janimo in #301
- Support oai in benchmark/mmlu by @merrymercy in #323
- Update version to v0.1.14 by @merrymercy in #324
- Cleanup codebase: removed unnecessary code/logic by @Qubitium in #298
- Update dependencies by @janimo in #326
- Openrouter usage example by @janimo in #327
model_rpc
style improvement by @hnyls2002 in #293model_runner
simplify by @hnyls2002 in #329- Logprobs Refractor by @hnyls2002 in #331
DBRX
support by @hnyls2002 in #337- Add support for new autogptq quant_config.checkpoint_format by @Qubitium in #332
- Fix llava parallelism/fork bug by @lockon-n in #315
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @hnyls2002 in #343
- Revert "Eliminate 2 gpu ops during sampling when logit_bias is zero" by @hnyls2002 in #345
- Eliminate 2 gpu ops during sampling when logit_bias is zero by @Qubitium in #338
- Add timeout to get_meta_info by @SimoneRaponi in #346
- Fix typos in infer_batch.py by @tom-doerr in #354
- Time cost utils by @hnyls2002 in #355
- Update README.md by @eltociear in #358
- support
command-r
by @ZhouXingg in #369 - Fix issue #367 – System message not supported for Anthropic (anthropic.BadRequestError) by @fronx in #368
- Update model support in readme by @Ying1123 in #370
- Optimize radix tree matching by @ispobock in #364
- Reduce overhead when
fork(1)
by @hnyls2002 in #375 - llama3 instruct template by @qeternity in #372
- add
.isort.cfg
by @hnyls2002 in #378 - Revert removing the unused imports by @hnyls2002 in #385
- Benchmark Updates by @hnyls2002 in #382
- Improve performance when running with full parallel by @hnyls2002 in #394
- Minor: style improvement of radix_cache and memory_pool by @hnyls2002 in #395
- Format Benchmark Code by @hnyls2002 in #399
- Fix chatml template by @merrymercy in #406
- Adding RAG tracing & eval cookbook using Parea by @joschkabraun in #390
- SamplingParams add "spaces_between_special_tokens" argument by @ZhouXingg in #392
- Organize Benchmark by @hnyls2002 in #381
- Add Cohere Command R chat template by @noah-kim-theori in #411
- Fix
sync()
whenfork(1)
by @hnyls2002 in #412 - Include finish reason in meta info response by @qeternity in #415
- Make public APIs more standard. by @hnyls2002 in #416
- Compat with latest VLLM 0.4.2 main + fork.number rename + Flashinfer 0.0.4 by @Qubitium in #380
- Optimize the memory usage of logits processor by @merrymercy in #420
- Clean up by @merrymercy in #422
- Fix logit processor bugs by @merrymercy in #427
- Minor fix for the import path by @merrymercy in #428
- Move openai api server into a separate file by @merrymercy in #429
- Fix flashinfer by @merrymercy in #430
- Update version to 0.1.15 by @merrymercy in #431
- Misc fixes by @merrymercy in #432
- Allow
input_ids
in the input of the/generate
endpoint by @lolipopshock in #363 - Improve error handling by @merrymercy in #433
- Cache optimizations by @hnyls2002 in #418
- Update readme by @merrymercy in #434
- Raise errors for prompts that are too long by @merrymercy in #436
- support llava video by @ZhangYuanhan-AI in #426
- Fix streaming by @merrymercy in #437
- Update version to 0.1.16 by @merrymercy in #438
New Contributors
- @Qubitium made their first contribution in #281
- @amirarsalan90 made their first contribution in #242
- @Liurl21 made their first contribution in #290
- @ZhouGongZaiShi made their first contribution in #291
- @Luodian made their first contribution in #311
- @janimo made their first contribution in #304
- @lockon-n made their first contribution in #315
- @SimoneRaponi made their first contribution in #346
- @tom-doerr made their first contribution in #354
- @ZhouXingg made their first contribution in #369
- @fronx made their first contribution in #368
- @ispobock made their first contribution in #364
- @joschkabraun made their first contribution in #390
- @noah-kim-theori made their first contribution in #411
- @lolipopshock made their first contribution in #363
- @ZhangYuanhan-AI made their first contribution in #426
Full Changelog: v0.1.13...v0.1.16
Release v0.1.13
Highlights
- Gemma Support by @hnyls2002 in #256
- Add Together and AzureOpenAI examples by @merrymercy in #184
What's Changed
- correct a mistake on the README.md by @yaya-sy in #182
- correct reference dtype openai.py by @yaya-sy in #181
- Add Together and AzureOpenAI examples by @merrymercy in #184
- Fix server launch for jupyter notebook by @merrymercy in #186
- Refactor decoding logprob and add completion_tokens_wo_jump_forward by @comaniac in #189
- Pin outlines version by @comaniac in #196
- Adjust outlines version. by @hnyls2002 in #200
- Update README.md by @eltociear in #207
- Added the ability to Modify the Context Length by @psych0v0yager in #210
- Fix logprobs with logprob_start_len by @comaniac in #193
- Support outlines > 0.0.31 by @comaniac in #219
- Fix stop str merging by @hnyls2002 in #225
- Fix interpreter.py
get_var(var_name)
in text iter whenstream
is not enabled by @exceedzhang in #198 - fix chatml template by @qeternity in #195
- Upload
agent_calls.jsonl
download link by @hnyls2002 in #226 - Fix addr reuse in check_port by @hnyls2002 in #253
- Add SSL Cert Functionality by @nivibilla in #224
- Refactor ChatTemplate for Enhanced Clarity and Efficiency by @cubxxw in #201
- Add
set_var
to interpreter.py by @1024th in #263 - Add logo by @merrymercy in #275
- Fix qwen config by @hnyls2002 in #261
- replace skip_embed with input_embeds by @TideDra in #222
- Gemma Support by @hnyls2002 in #256
- Improve gemma and documentations by @merrymercy in #278
- Organize
server_args
by @hnyls2002 in #277 - Add Support for API Key Authentication by @alessiodallapiazza in #230
- Fix RuntimeEndpoint by @merrymercy in #279
- Update version to v0.1.13 by @merrymercy in #280
New Contributors
- @psych0v0yager made their first contribution in #210
- @exceedzhang made their first contribution in #198
- @qeternity made their first contribution in #195
- @cubxxw made their first contribution in #201
- @1024th made their first contribution in #263
- @TideDra made their first contribution in #222
- @alessiodallapiazza made their first contribution in #230
Full Changelog: v0.1.12...v0.1.13