Skip to content

Release v0.3.0

Latest
Compare
Choose a tag to compare
@Ying1123 Ying1123 released this 19 Sep 10:09
· 110 commits to main since this release
5ab9418

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

  • Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
  • Up to 1.5x lower latency with torch.compile on small batch sizes
  • Support for interleaved text and multi-image/video in LLaVA-OneVision
  • Support for interleaved window attention and 2x longer context length in Gemma-2
  • Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
  • Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

New Contributors

Full Changelog: v0.2.13...v0.3.0