Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release V0.3.0

03 Apr 01:55
4822fba
Compare
Choose a tag to compare

Highlight

  • Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
  • Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.6...v0.3.0

LMDeploy Release V0.2.6

19 Mar 02:43
b69e717
Compare
Choose a tag to compare

Highlight

Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL

  • VLM Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from here

  • VLM serving by openai compatible server
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
  • VLM Serving by gradio
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.5...v0.2.6

LMDeploy Release V0.2.5

05 Mar 08:39
c5f4014
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.4...v0.2.5

LMDeploy Release V0.2.4

22 Feb 03:44
24ea5dc
Compare
Choose a tag to compare

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.3...v0.2.4

LMDeploy Release V0.2.3

06 Feb 06:14
2831dc2
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

  • Remove caching tokenizer.json by @grimoire in #1074
  • Refactor get_logger to remove the dependency of MMLogger from mmengine by @yinfan98 in #1064
  • Use TM_LOG_LEVEL environment variable first by @zhyncs in #1071
  • Speed up the initialization of w8a8 model for torch engine by @yinfan98 in #1088
  • Make logging.logger's behavior consistent with MMLogger by @irexyc in #1092
  • Remove owned_session for torch engine by @grimoire in #1097
  • Unify engine initialization in pipeline by @irexyc in #1085
  • Add skip_special_tokens in GenerationConfig by @grimoire in #1091
  • Use default stop words for turbomind backend in pipeline by @irexyc in #1119
  • Add input_token_len to Response and update Response document by @AllentDan in #1115

🐞 Bug fixes

  • Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in #992
  • Fix turbomind CUDA runtime error invalid argument by @zhyncs in #1100
  • Add safety check for incremental decode by @AllentDan in #1094
  • Fix device type of get_ppl for turbomind by @RunningLeon in #1093
  • Fix pipeline init turbomind from workspace by @irexyc in #1126
  • Add dependency version check and fix ignore_eos logic by @grimoire in #1099
  • Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in #1129

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.2...v0.2.3

LMDeploy Release V0.2.2

31 Jan 09:57
4a28f12
Compare
Choose a tag to compare

Highlight

English version

  • The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
  • The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • Add api key and ssl to api_server

Chinese version

  • TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
  • Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • api_server 在接口中增加了 api_key

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.1...v0.2.2

LMDeploy Release V0.2.1

19 Jan 10:38
e96e2b4
Compare
Choose a tag to compare

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

  • add guide about installation on cuda 12+ platform by @lvhan028 in #988

🌐 Other

Full Changelog: v0.2.0...v0.2.1

LMDeploy Release V0.2.0

17 Jan 02:00
b319dce
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.1.0...v0.2.0

LMDeploy Release V0.1.0

18 Dec 12:10
477f2db
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.0.14...v0.1.0

LMDeploy Release V0.1.0a2

06 Dec 06:50
fddad30
Compare
Choose a tag to compare

What's Changed

💥 Improvements

  • Unify prefill & decode passes by @lzhangzz in #775
  • add cuda12.1 build check ci by @irexyc in #782
  • auto upload cuda12.1 python pkg to release when create new tag by @irexyc in #784
  • Report the inference benchmark of models with different size by @lvhan028 in #794
  • Add chat template for Yi by @AllentDan in #779

🐞 Bug fixes

  • Fix early-exit condition in attention kernel by @lzhangzz in #788
  • Fix missed arguments when benchmark static inference performance by @lvhan028 in #787
  • fix extra colon in InternLMChat7B template by @C1rN09 in #796
  • Fix local kv head num by @lvhan028 in #806

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.1.0a1...v0.1.0a2