Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference tasks and milestones #11

Open
jlamypoirier opened this issue Jan 26, 2023 · 0 comments
Open

Inference tasks and milestones #11

jlamypoirier opened this issue Jan 26, 2023 · 0 comments

Comments

@jlamypoirier
Copy link
Collaborator

jlamypoirier commented Jan 26, 2023

[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.

Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)

  • Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
  • Task 1.2 Add basic profiling support to our benchmarking code (@jlamypoirier). Profiling and misc #10
  • Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).

Milestone 2: Turn our starter implementation into a strong baseline.

Milestone 3: Scaling up

  • Task 3.1: Look into alternative libraries (semi-optional)
    • Try inference with Megatron
    • Add MQA support to deepspeed
    • Other suggestions?
  • Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
  • Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
  • Task 3.4: Optimize for bigger models.
  • Task 3.5: Benchmark the bigger models.

Milestone 4: Deployment

  • Task 4.1: Optimize end-to-end model performance
    • Optimize tokenization
    • Optimize decoding
    • Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
  • Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
  • Task 4.3: Integrate our optimized model into HF transformers. [WIP] Adding GPT2 with Multi Query Attention huggingface/transformers#21253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant