[draft]GQA MLFloat16 cpu #22102

wangyems · 2024-09-16T01:32:37Z

Description

TODO: calling mlas_gemm() inside GemmEx<MLFloat16, ThreadPool>()

Motivation and Context

tianleiwu · 2024-09-16T04:28:48Z

onnxruntime/contrib_ops/cpu/bert/attention_helper.h

      }

      if (max < 0.0f) {
        max = 0.0f;
      }

      for (int i = 0; i < D; i++) {
-        y[i] = expf(x[i] - max);
+        y[i] = static_cast<T>(expf(static_cast<float>(x[i]) - max));


when T is float16, it will overflow easily.
I think we cannot use softmax inplace for float16. A float buffer is needed.

should the expf(x[i] - max) belong to (0, 1]?

Right. It will not overflow.
I think we need to keep intermediate data as float/double to keep the accuracy. Every time data casted from float to half will cause accuracy loss. The loss is accumulated when we compute the sum below.

tianleiwu · 2024-09-17T21:57:04Z

onnxruntime/core/util/math_cpu.cc

+template <>
+void GemmEx<MLFloat16, ThreadPool>(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, ptrdiff_t, ptrdiff_t, ptrdiff_t,
+                                   MLFloat16, const MLFloat16*, int, const MLFloat16*, int, MLFloat16,
+                                   MLFloat16*, int, ThreadPool*) {


Could we support the following:

A FP16, B FP16, C FP32 -- for QxK, then we can use FP32 for Softmax,

A FP32, B FP16, C FP16 -- for SxV

We don't have fp16 matmul implemented on CPU. I am thinking it may be better to convert the input and kv cache to fp16 in GQAAttention

Your Name added 3 commits September 15, 2024 06:14

init

b1a88d5

update

0b60013

eigen MLFloat16

a5a6ae0

tianleiwu reviewed Sep 16, 2024

View reviewed changes

tianleiwu reviewed Sep 17, 2024

View reviewed changes

restore fp32 softmax

bbbdfc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft]GQA MLFloat16 cpu #22102

[draft]GQA MLFloat16 cpu #22102

wangyems commented Sep 16, 2024 •

edited

Loading

tianleiwu Sep 16, 2024

wangyems Sep 16, 2024

tianleiwu Sep 16, 2024

tianleiwu Sep 17, 2024 •

edited

Loading

yufenglee Sep 17, 2024

[draft]GQA MLFloat16 cpu #22102

Are you sure you want to change the base?

[draft]GQA MLFloat16 cpu #22102

Conversation

wangyems commented Sep 16, 2024 • edited Loading

Description

Motivation and Context

tianleiwu Sep 16, 2024

Choose a reason for hiding this comment

wangyems Sep 16, 2024

Choose a reason for hiding this comment

tianleiwu Sep 16, 2024

Choose a reason for hiding this comment

tianleiwu Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

yufenglee Sep 17, 2024

Choose a reason for hiding this comment

wangyems commented Sep 16, 2024 •

edited

Loading

tianleiwu Sep 17, 2024 •

edited

Loading