Smaug support #212

masahi · 2024-02-15T06:21:57Z

It turned out the upstream MLC never supported multi-gpu for models that uses bias before / after attention. So I needed to define a sharding func for bias.

@sunggg @vinx13

$ /opt/bin/cuda-reserve.py  --num-gpus 2 python serve/tests/test_engine.py --local-id  Smaug-72B-v0.1-q0f16-presharded-2gpu --max-num-batched-tokens 512       
2024-02-15 06:15:16 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=89 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/engin
e/staging_engine.py process=299061
2024-02-15 06:15:19 [info     ] Loading parameters from dist/Smaug-72B-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=70 pathname=/home/masahi/proj
ects/dev/mlc-llm/serve/mlc_serve/model/tvm_model.py process=299259
2024-02-15 06:16:00 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=559 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Using 63 cache blocks.         [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=588 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=612 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=361 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=299259
...
Prompt = 'Hello, my name is'
Generated 0-th sample = ' John. I'm a boy. I'm ten years old. I'm tall and thin. I'

Prompt = 'The capital of France is'
Generated 0-th sample = ' Paris． （对划线部分提问）
____ the capital of France？
考查特殊疑问句．'

Prompt = 'The president of the United States is a powerful man. But he can also be'
Generated 0-th sample = ' a lonely man. He can't just go out and have a quiet dinner with his family. He'

Prompt = 'The future of AI is full of promise. But we need to carefully'
Generated 0-th sample = ' consider the ethical implications of AI and ensure that it is developed and used responsibly.'

masahi · 2024-02-15T06:28:04Z

mlc_llm/relax_model/commons.py

@@ -112,6 +131,7 @@ def moe_shard_gate_up_weight_scale(weight: relax.TensorStructInfo):

    return {
        "shard_qkv": shard_qkv_weight_scale,
+        "shard_qkv_bias": shard_bias,
        "shard_mlp_k": shard_k_weight_scale,
        "shard_o_proj_k": shard_k_weight_scale,


I don't understand why the bias for output projection must not be sharded. Initially I sharded it as well but the result was incorrect. Then I remember that the 1D scale for output projection in FT quantization must not be sharded as well https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/commons.py#L316-L320. So I skipped the bias shard for output proj and it worked.

if the shading is done for the reduction dimension, bias doesn't need to be shared, instead, bias need to be after all reduce or divided by num_shards

sunggg

Thanks for the quick addition, @masahi!

masahi added 3 commits February 15, 2024 04:50

wip, single gpu works

eedd931

wip

5247b22

works

e1573b6

masahi commented Feb 15, 2024

View reviewed changes

set vocab_size in benchmark correctly

7618595

sunggg approved these changes Feb 16, 2024

View reviewed changes

sunggg merged commit abe93a1 into octoml:batch-serving Feb 16, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaug support #212

Smaug support #212

masahi commented Feb 15, 2024 •

edited

Loading

masahi Feb 15, 2024 •

edited

Loading

vinx13 Feb 15, 2024 •

edited

Loading

sunggg left a comment

Smaug support #212

Smaug support #212

Conversation

masahi commented Feb 15, 2024 • edited Loading

masahi Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

vinx13 Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

sunggg left a comment

Choose a reason for hiding this comment

masahi commented Feb 15, 2024 •

edited

Loading

masahi Feb 15, 2024 •

edited

Loading

vinx13 Feb 15, 2024 •

edited

Loading