Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flex scheduler #1142

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Flex scheduler #1142

wants to merge 5 commits into from

Conversation

yukavio
Copy link

@yukavio yukavio commented Aug 18, 2024

Motivation

Implement a better dispatch scheduler for DP mode, which could dispatch new requests depending on the remaining resources of different inference processes. It could help the server get better TTFT with high request rate compare to round-robin algorithm.

Modification

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

@merrymercy
Copy link
Contributor

Hi @yukavio, can you briefly describe the context of this PR?

@yukavio
Copy link
Author

yukavio commented Aug 26, 2024

Hi @yukavio, can you briefly describe the context of this PR?

I am trying to implement a load balancing strategy that is better than round robin when using Data Parallel.

@merrymercy
Copy link
Contributor

merrymercy commented Aug 26, 2024

If you can provide some context, high-level descriptions, and performance numbers, it can help us understand this PR better.

@yukavio
Copy link
Author

yukavio commented Aug 27, 2024

If you can provide some context, high-level descriptions, and performance numbers, it can help us understand this PR better.

No problem. The entire strategy is not yet fully determined and I am still trying to do further iterative optimization. I will provide an overall description and corresponding performance data after completion.

@yukavio
Copy link
Author

yukavio commented Aug 30, 2024

We implement a resources-aware scheduler in dp mode which could be enabled with --load-balance-method resources_aware. We tested the performance of the resources-aware scheduler and compared it with the round-robin algorithm on an 8*A100(40G) node.
The detail of bench command: python3 -m sglang.bench_serving --backend sglang --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 20000 --request-rate 16.0

round_robin:
round_robin

resouces_aware:
recources_aware

and if we change the argument of num-continue-decode-step from 10 to 1 in global config, we could get better on ttft:
resources_aware2

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

Hi @yukavio Nice work! Could you resolve the conflicts? Thanks.

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

@yukavio Also, it would be better to benchmark Llama 3.1 8B Instruct and Llama 3.1 70B Instruct. Thanks.

@yukavio
Copy link
Author

yukavio commented Aug 30, 2024

@yukavio Also, it would be better to benchmark Llama 3.1 8B Instruct and Llama 3.1 70B Instruct. Thanks.

@zhyncs I can provide performance comparison results on LLAMA3.1 8B later. However, for LLAMA3.1 70B, the improvements in this PR are mainly based on DP. But if it's a 70GB model, it requires multiple 8-GPU machines to start enough DP Workers to conduct this test. For me, it's difficult to gather that many machines for testing.

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

But if it's a 70GB model, it requires multiple 8-GPU machines to start enough DP Workers to conduct this test.

@Ying1123 May you help take a look? Thanks.

# """A scheduler which dispatch """


class ControllerMultiFlex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert the name change

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

@@ -0,0 +1 @@
python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-1.8B-Chat --host 0.0.0.0 --port 8080 --mem-fraction-static 0.6 --chunked-prefill-size 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to check this in?

@yukavio
Copy link
Author

yukavio commented Sep 2, 2024

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

I met some problems when merging upstream branch and I am trying to fix them. I will push the fix commit later and add a test case for it.

@yukavio
Copy link
Author

yukavio commented Sep 12, 2024

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

We met memory problems after merging the latest main branch. Details can be found at #1405. It looks like the latest main branch has some issues with memory management which did not happen in the old version.

@merrymercy
Copy link
Contributor

#1405 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants