-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MlasSgemmKernel: properly process more than 2 rows #22125
base: main
Are you sure you want to change the base?
Fix MlasSgemmKernel: properly process more than 2 rows #22125
Conversation
This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific optimized function is not available yet, like s390x.
@AlekseiNikiforovIBM please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
It is possible to reproduce test failures on x86_64 if you disable optimized functions like this: test-only patch for x86_64
Full list of fixed tests on s390x: fixed tests:
|
BElements01 = B[5]; | ||
BElements02 = B[6]; | ||
BElements03 = B[7]; | ||
BElements00 = b[16]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in scalar, the packing wide is 4 not 16.
why is it related to MatMul_U8S8S8, a quantization fusion? |
This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific
optimized function is not available yet, like s390x.
Description
Matrix B is packed by 16 elements, thus new row starts 16 items later. Also, for next C increment index only by 1 for each increment of C.
Motivation and Context
This change fixes mlas sgemm fallback implementation for all architectures which don't have architecture-specific implementations available, like s390x.