-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft]GQA MLFloat16 cpu #22102
base: main
Are you sure you want to change the base?
[draft]GQA MLFloat16 cpu #22102
Conversation
} | ||
|
||
if (max < 0.0f) { | ||
max = 0.0f; | ||
} | ||
|
||
for (int i = 0; i < D; i++) { | ||
y[i] = expf(x[i] - max); | ||
y[i] = static_cast<T>(expf(static_cast<float>(x[i]) - max)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when T is float16, it will overflow easily.
I think we cannot use softmax inplace for float16. A float buffer is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the expf(x[i] - max) belong to (0, 1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It will not overflow.
I think we need to keep intermediate data as float/double to keep the accuracy. Every time data casted from float to half will cause accuracy loss. The loss is accumulated when we compute the sum below.
template <> | ||
void GemmEx<MLFloat16, ThreadPool>(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, ptrdiff_t, ptrdiff_t, ptrdiff_t, | ||
MLFloat16, const MLFloat16*, int, const MLFloat16*, int, MLFloat16, | ||
MLFloat16*, int, ThreadPool*) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we support the following:
- A FP16, B FP16, C FP32 -- for QxK, then we can use FP32 for Softmax,
- A FP32, B FP16, C FP16 -- for SxV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have fp16 matmul implemented on CPU. I am thinking it may be better to convert the input and kv cache to fp16 in GQAAttention
Description
TODO: calling mlas_gemm() inside GemmEx<MLFloat16, ThreadPool>()
Motivation and Context