Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft]GQA MLFloat16 cpu #22102

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

[draft]GQA MLFloat16 cpu #22102

wants to merge 4 commits into from

Conversation

wangyems
Copy link
Member

@wangyems wangyems commented Sep 16, 2024

Description

TODO: calling mlas_gemm() inside GemmEx<MLFloat16, ThreadPool>()

Motivation and Context

}

if (max < 0.0f) {
max = 0.0f;
}

for (int i = 0; i < D; i++) {
y[i] = expf(x[i] - max);
y[i] = static_cast<T>(expf(static_cast<float>(x[i]) - max));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when T is float16, it will overflow easily.
I think we cannot use softmax inplace for float16. A float buffer is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the expf(x[i] - max) belong to (0, 1]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It will not overflow.
I think we need to keep intermediate data as float/double to keep the accuracy. Every time data casted from float to half will cause accuracy loss. The loss is accumulated when we compute the sum below.

template <>
void GemmEx<MLFloat16, ThreadPool>(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, ptrdiff_t, ptrdiff_t, ptrdiff_t,
MLFloat16, const MLFloat16*, int, const MLFloat16*, int, MLFloat16,
MLFloat16*, int, ThreadPool*) {
Copy link
Contributor

@tianleiwu tianleiwu Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we support the following:

  • A FP16, B FP16, C FP32 -- for QxK, then we can use FP32 for Softmax,
  • A FP32, B FP16, C FP16 -- for SxV

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have fp16 matmul implemented on CPU. I am thinking it may be better to convert the input and kv cache to fp16 in GQAAttention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants