You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.
Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)
Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).
Milestone 2: Turn our starter implementation into a strong baseline.
Task 2.1: Verify our MQA implementation for correctness.
Task 2.2: Add complete support for SantaCoder models. The released checkpoints use a different version of the code, so some changes will be needed. We will also need to adapt our benchmarking code.
Task 2.3: Collaborate with the evaluation team to ensure a common codebase.
Task 2.5: After the other steps, benchmark inference.
Milestone 3: Scaling up
Task 3.1: Look into alternative libraries (semi-optional)
Try inference with Megatron
Add MQA support to deepspeed
Other suggestions?
Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
Task 3.4: Optimize for bigger models.
Task 3.5: Benchmark the bigger models.
Milestone 4: Deployment
Task 4.1: Optimize end-to-end model performance
Optimize tokenization
Optimize decoding
Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.
Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)
Milestone 2: Turn our starter implementation into a strong baseline.
Milestone 3: Scaling up
Milestone 4: Deployment
The text was updated successfully, but these errors were encountered: