Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaug support #212

Merged
merged 4 commits into from
Feb 16, 2024
Merged

Smaug support #212

merged 4 commits into from
Feb 16, 2024

Conversation

masahi
Copy link
Member

@masahi masahi commented Feb 15, 2024

It turned out the upstream MLC never supported multi-gpu for models that uses bias before / after attention. So I needed to define a sharding func for bias.

@sunggg @vinx13

$ /opt/bin/cuda-reserve.py  --num-gpus 2 python serve/tests/test_engine.py --local-id  Smaug-72B-v0.1-q0f16-presharded-2gpu --max-num-batched-tokens 512       
2024-02-15 06:15:16 [info     ] StagingInferenceEngine.start   [mlc_serve.engine.staging_engine] func_name=start lineno=89 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/engin
e/staging_engine.py process=299061
2024-02-15 06:15:19 [info     ] Loading parameters from dist/Smaug-72B-v0.1-q0f16-presharded-2gpu. [mlc_serve.model.tvm_model] func_name=get_tvm_model lineno=70 pathname=/home/masahi/proj
ects/dev/mlc-llm/serve/mlc_serve/model/tvm_model.py process=299259
2024-02-15 06:16:00 [info     ] Running memory profiling.      [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=559 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Using 63 cache blocks.         [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=588 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Allocated KV cache blocks.     [mlc_serve.model.tvm_model] func_name=init_tvm_model lineno=612 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/m
odel/tvm_model.py process=299259
2024-02-15 06:16:01 [info     ] Model is initalized.           [mlc_serve.engine.staging_engine_worker] func_name=run_generation_loop_worker lineno=361 pathname=/home/masahi/projects/dev/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py process=299259
...
Prompt = 'Hello, my name is'
Generated 0-th sample = ' John. I'm a boy. I'm ten years old. I'm tall and thin. I'

Prompt = 'The capital of France is'
Generated 0-th sample = ' Paris. (对划线部分提问)
____ the capital of France?
考查特殊疑问句.'

Prompt = 'The president of the United States is a powerful man. But he can also be'
Generated 0-th sample = ' a lonely man. He can't just go out and have a quiet dinner with his family. He'

Prompt = 'The future of AI is full of promise. But we need to carefully'
Generated 0-th sample = ' consider the ethical implications of AI and ensure that it is developed and used responsibly.'

@@ -112,6 +131,7 @@ def moe_shard_gate_up_weight_scale(weight: relax.TensorStructInfo):

return {
"shard_qkv": shard_qkv_weight_scale,
"shard_qkv_bias": shard_bias,
"shard_mlp_k": shard_k_weight_scale,
"shard_o_proj_k": shard_k_weight_scale,
Copy link
Member Author

@masahi masahi Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the bias for output projection must not be sharded. Initially I sharded it as well but the result was incorrect. Then I remember that the 1D scale for output projection in FT quantization must not be sharded as well https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/commons.py#L316-L320. So I skipped the bias shard for output proj and it worked.

Copy link
Member

@vinx13 vinx13 Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the shading is done for the reduction dimension, bias doesn't need to be shared, instead, bias need to be after all reduce or divided by num_shards

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick addition, @masahi!

@sunggg sunggg merged commit abe93a1 into octoml:batch-serving Feb 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants