Release LMDeploy Release v0.6.0 · InternLM/lmdeploy

Highlight

Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
Refactor PytorchEngine
- Employ CUDA graph to boost the inference performance (30%)
- Support more models in Huawei Ascend platform
Upgrade GenerationConfig
- Support min_p sampling
- Add do_sample=False as the default option
- Remove EngineGenerationConfig and merge it to GenertionConfig
Support guided decoding
Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

Break Changes

TurboMind model converter. Please re-convert the models if you uses this feature
EngineGenerationConfig is removed. Please use GenerationConfig instead
Chat template. Please use --chat-template to specify it

What's Changed

🚀 Features

support vlm custom image process parameters in openai input format by @irexyc in #2245
New GEMM kernels for weight-only quantization by @lzhangzz in #2090
Fix hidden size and support mistral nemo by @AllentDan in #2215
Support custom logits processors by @AllentDan in #2329
support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
Support phi3.5 for pytorch engine by @RunningLeon in #2361
Add auto_gptq to lmdeploy lite by @AllentDan in #2372
build(ascend): add Dockerfile for ascend aarch64 910B by @CyCle1024 in #2278
Support guided decoding for pytorch backend by @AllentDan in #1856
support min_p sampling parameter by @irexyc in #2420
Refactor pytorch engine by @grimoire in #2104
refactor pytorch engine(ascend) by @yao-fengchen in #2440

💥 Improvements

Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
remove eviction param by @grimoire in #2285
Remove QoS serving by @AllentDan in #2294
Support send tool_calls back to internlm2 by @AllentDan in #2147
Add stream options to control usage by @AllentDan in #2313
add device type for pytorch engine in cli by @RunningLeon in #2321
Update error status_code to raise error in openai client by @AllentDan in #2333
Change to use device instead of device-type in cli by @RunningLeon in #2337
Add GEMM test utils by @lzhangzz in #2342
Add environment variable to control SILU fusion by @lzhangzz in #2343
Use single thread per model instance by @lzhangzz in #2339
add cache to speed up docker building by @RunningLeon in #2344
add max_prefill_token_num argument in CLI by @lvhan028 in #2345
torch engine optimize prefill for long context by @grimoire in #1962
Refactor turbomind (1/N) by @lzhangzz in #2352
feat(server): enable seed parameter for openai compatible server. by @DearPlanet in #2353
support do_sample parameter by @irexyc in #2375
refactor TurbomindModelConfig by @lvhan028 in #2364
import dlinfer before imageencoding by @jinminxi104 in #2413
ignore *.pth when download model from model hub by @lvhan028 in #2426
inplace logits process as default by @grimoire in #2427
handle invalid images by @irexyc in #2312
Split token_embs and lm_head weights by @irexyc in #2252
build: update ascend dockerfile by @CyCle1024 in #2421
build nccl in dockerfile for cuda11.8 by @RunningLeon in #2433
automatically set max_batch_size according to the device when it is not specified by @lvhan028 in #2434
rename the ascend dockerfile by @lvhan028 in #2403
refactor ascend kernels by @yao-fengchen in #2355

🐞 Bug fixes

enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
Fix internvl2 template and update docs by @irexyc in #2292
fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
fix cache position for pytorch engine by @RunningLeon in #2388
Fix /v1/completions batch order wrong by @AllentDan in #2395
Fix some issues encountered by modelscope and community by @irexyc in #2428
fix llama3 rotary in pytorch engine by @grimoire in #2444
fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
fix MultinomialSampling operator builder by @grimoire in #2460
Fix initialization of runtime_min_p by @irexyc in #2461
fix Windows compile error by @zhyncs in #2303
fix: follow up #2303 by @zhyncs in #2307

📚 Documentations

Reorganize the user guide and update the get_started section by @lvhan028 in #2038
cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
Add user guide about slora serving by @AllentDan in #2084
Reorganize the table of content of get_started by @lvhan028 in #2378
fix get_started user guide unaccessible by @lvhan028 in #2410
add Ascend get_started by @jinminxi104 in #2417

🌐 Other

test prtest image update by @zhulinJulia24 in #2192
Update python support version by @wuhongsheng in #2290
[ci] benchmark react by @zhulinJulia24 in #2183
bump version to v0.6.0a0 by @lvhan028 in #2371
[ci] add daily test's coverage report by @zhulinJulia24 in #2401
update actions/download-artifact to v4 to fix security issue by @lvhan028 in #2419
bump version to v0.6.0 by @lvhan028 in #2445

New Contributors

@wuhongsheng made their first contribution in #2290
@ColorfulDick made their first contribution in #2240
@DearPlanet made their first contribution in #2353
@jinminxi104 made their first contribution in #2413

Full Changelog: v0.5.3...v0.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDeploy Release v0.6.0

Highlight

Break Changes

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors