Skip to content

LMDeploy Release v0.6.0

Latest
Compare
Choose a tag to compare
@lvhan028 lvhan028 released this 13 Sep 03:12
· 5 commits to main since this release
e2aa4bd

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Refactor PytorchEngine
    • Employ CUDA graph to boost the inference performance (30%)
    • Support more models in Huawei Ascend platform
  • Upgrade GenerationConfig
    • Support min_p sampling
    • Add do_sample=False as the default option
    • Remove EngineGenerationConfig and merge it to GenertionConfig
  • Support guided decoding
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
    Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

Break Changes

  • TurboMind model converter. Please re-convert the models if you uses this feature
  • EngineGenerationConfig is removed. Please use GenerationConfig instead
  • Chat template. Please use --chat-template to specify it

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
  • fix cache position for pytorch engine by @RunningLeon in #2388
  • Fix /v1/completions batch order wrong by @AllentDan in #2395
  • Fix some issues encountered by modelscope and community by @irexyc in #2428
  • fix llama3 rotary in pytorch engine by @grimoire in #2444
  • fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
  • fix MultinomialSampling operator builder by @grimoire in #2460
  • Fix initialization of runtime_min_p by @irexyc in #2461
  • fix Windows compile error by @zhyncs in #2303
  • fix: follow up #2303 by @zhyncs in #2307

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0