Fix MlasSgemmKernel: properly process more than 2 rows #22125

AlekseiNikiforovIBM · 2024-09-18T11:49:53Z

This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific
optimized function is not available yet, like s390x.

Description

Matrix B is packed by 16 elements, thus new row starts 16 items later. Also, for next C increment index only by 1 for each increment of C.

Motivation and Context

This change fixes mlas sgemm fallback implementation for all architectures which don't have architecture-specific implementations available, like s390x.

This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific optimized function is not available yet, like s390x.

microsoft-github-policy-service · 2024-09-18T11:50:03Z

@AlekseiNikiforovIBM please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

Definitions.
“Code” means the computer software code, whether in human-readable or machine-executable form,
that is delivered by You to Microsoft under this Agreement.
“Project” means any of the projects owned or managed by Microsoft and offered under a license
approved by the Open Source Initiative (www.opensource.org).
“Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
Project, including but not limited to communication on electronic mailing lists, source code control
systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
discussing and improving that Project, but excluding communication that is conspicuously marked or
otherwise designated in writing by You as “Not a Submission.”
“Submission” means the Code and any other copyrightable material Submitted by You, including any
associated comments and documentation.
Your Submission. You must agree to the terms of this Agreement before making a Submission to any
Project. This Agreement covers any and all Submissions that You, now or in the future (except as
described in Section 4 below), Submit to any Project.
Originality of Work. You represent that each of Your Submissions is entirely Your original work.
Should You wish to Submit materials that are not Your original work, You may Submit them separately
to the Project if You (a) retain all copyright and license information that was in the materials as You
received them, (b) in the description accompanying Your Submission, include the phrase “Submission
containing materials of a third party:” followed by the names of the third party and any licenses or other
restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
guidelines concerning Submissions.
Your Employer. References to “employer” in this Agreement include Your employer or anyone else
for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
Submission is made in the course of Your work for an employer or Your employer has intellectual
property rights in Your Submission by contract or applicable law, You must secure permission from Your
employer to make the Submission before signing this Agreement. In that case, the term “You” in this
Agreement will refer to You and the employer collectively. If You change employers in the future and
desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
and secure permission from the new employer before Submitting those Submissions.
Licenses.

Copyright License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
parties.
Patent License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
Your patent claims that are necessarily infringed by the Submission or the combination of the
Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
import or otherwise dispose of the Submission alone or with the Project.
Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
granted by implication, exhaustion, estoppel or otherwise.

Representations and Warranties. You represent that You are legally entitled to grant the above
licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
have disclosed under Section 3). You represent that You have secured permission from Your employer to
make the Submission in cases where Your Submission is made in the course of Your work for Your
employer or Your employer has intellectual property rights in Your Submission by contract or applicable
law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
You later become aware that would make Your representations in this Agreement inaccurate in any
respect.
Information about Submissions. You agree that contributions to Projects and information about
contributions may be maintained indefinitely and disclosed publicly, including Your name and other
information that You submit with Your Submission.
Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
defenses of lack of personal jurisdiction and forum non-conveniens.
Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
supersedes any and all prior agreements, understandings or communications, written or oral, between
the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

AlekseiNikiforovIBM · 2024-09-18T11:57:10Z

It is possible to reproduce test failures on x86_64 if you disable optimized functions like this:

test-only patch for x86_64

diff --git a/cmake/onnxruntime_mlas.cmake b/cmake/onnxruntime_mlas.cmake
index cf23416943..343250b448 100644
--- a/cmake/onnxruntime_mlas.cmake
+++ b/cmake/onnxruntime_mlas.cmake
@@ -671,11 +671,11 @@ endif()
           set(MLAS_SOURCE_IS_NOT_SET 0)
         endif()
     endif()
-    if(NOT ONNXRUNTIME_MLAS_MULTI_ARCH AND MLAS_SOURCE_IS_NOT_SET)
-        file(GLOB_RECURSE mlas_platform_srcs
+    if(true)
+        file(GLOB_RECURSE mlas_platform_srcs2
           "${MLAS_SRC_DIR}/scalar/*.cpp")
     endif()
-    target_sources(onnxruntime_mlas PRIVATE ${mlas_platform_srcs})
+    target_sources(onnxruntime_mlas PRIVATE ${mlas_platform_srcs} ${mlas_platform_srcs2})
 endif()
 
 foreach(mlas_target ${ONNXRUNTIME_MLAS_LIBS})
diff --git a/onnxruntime/core/mlas/lib/mlasi.h b/onnxruntime/core/mlas/lib/mlasi.h
index 6f5db766b7..b2eeeaa9c2 100644
--- a/onnxruntime/core/mlas/lib/mlasi.h
+++ b/onnxruntime/core/mlas/lib/mlasi.h
@@ -358,6 +358,20 @@ size_t
     bool ZeroMode
     );
 
+typedef
+size_t
+(MLASCALL MLAS_GEMM_FLOAT_KERNEL_COMMON)(
+    const float* A,
+    const float* B,
+    float* C,
+    size_t CountK,
+    size_t CountM,
+    size_t CountN,
+    size_t lda,
+    size_t ldc,
+    float alpha
+    );
+
 #else
 
 #if defined(__aarch64__) && defined(__linux__)
@@ -727,6 +741,8 @@ extern "C" {
 #if defined(MLAS_TARGET_AMD64_IX86)
     MLAS_GEMM_FLOAT_KERNEL MlasGemmFloatKernelSse;
     MLAS_GEMM_FLOAT_KERNEL MlasGemmFloatKernelAvx;
+    MLAS_GEMM_FLOAT_KERNEL_COMMON MlasSgemmKernelZero;
+    MLAS_GEMM_FLOAT_KERNEL_COMMON MlasSgemmKernelAdd;
 #if defined(MLAS_TARGET_AMD64)
     MLAS_GEMM_FLOAT_KERNEL MlasGemmFloatKernelFma3;
     MLAS_GEMM_FLOAT_KERNEL MlasGemmFloatKernelAvx512F;
diff --git a/onnxruntime/core/mlas/lib/platform.cpp b/onnxruntime/core/mlas/lib/platform.cpp
index 4cd7faaa9e..9352dd62ad 100644
--- a/onnxruntime/core/mlas/lib/platform.cpp
+++ b/onnxruntime/core/mlas/lib/platform.cpp
@@ -285,7 +285,7 @@ Return Value:
     this->QuantizeLinearS4Kernel = MlasQuantizeLinearS4Kernel;
     this->QuantizeLinearU4Kernel = MlasQuantizeLinearU4Kernel;
 #ifndef __APPLE__
-    this->CastF16ToF32Kernel = &MlasCastF16ToF32KernelSse;
+    this->CastF16ToF32Kernel = nullptr;
 #endif  // __APPLE__
 
     this->NchwcBlockSize = 8;
@@ -308,7 +308,7 @@ Return Value:
     // Check if the processor supports SSE 4.1 instructions.
     //
 
-    if ((Cpuid1[2] & 0x80000) != 0) {
+    if (false) {
         this->GemmU8S8Dispatch = &MlasGemmU8S8DispatchSse41;
     }
 
@@ -318,7 +318,7 @@ Return Value:
     // Check if the processor supports the AVX and OSXSAVE features.
     //
 
-    if ((Cpuid1[2] & 0x18000000) == 0x18000000) {
+    if (false) {
 
         //
         // Check if the operating system supports saving SSE and AVX states.
@@ -330,7 +330,7 @@ Return Value:
 
             this->GemmFloatKernel = MlasGemmFloatKernelAvx;
 
-#if defined(MLAS_TARGET_AMD64)
+#if 0
 
             this->KernelM1Routine = MlasSgemmKernelM1Avx;
             this->KernelM1TransposeBRoutine = MlasSgemmKernelM1TransposeBAvx;
@@ -416,7 +416,7 @@ Return Value:
                     this->SQNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni;
                 }
 
-#if !defined(ORT_MINIMAL_BUILD)
+#if 0
 
                 //
                 // Check if the processor supports AVX512F features and the
@@ -478,7 +478,7 @@ Return Value:
                 // Check if the processor supports AVX NE CONVERT.
                 //
                 if ((Cpuid7_1[3] & (0b1 << 5)) != 0) {
-                    this->CastF16ToF32Kernel = &MlasCastF16ToF32KernelAvx;
+                    this->CastF16ToF32Kernel = nullptr;
                 }
 #endif  // (defined(_MSC_VER) && (_MSC_VER >= 1933)) || (defined(__GNUC__) && (__GNUC__ >= 13))
 
diff --git a/onnxruntime/core/mlas/lib/qgemm.h b/onnxruntime/core/mlas/lib/qgemm.h
index 127aea9029..7949e036d8 100644
--- a/onnxruntime/core/mlas/lib/qgemm.h
+++ b/onnxruntime/core/mlas/lib/qgemm.h
@@ -871,7 +871,7 @@ MlasGemmQuantGetDispatch(
         GemmQuantDispatch = &MlasGemmQuantDispatchDefault;
     }
 
-#if defined(MLAS_TARGET_AMD64_IX86) || defined(MLAS_TARGET_LARCH64)
+#if 0
     if (!AIsSigned) {
         if (BIsSigned) {
             GemmQuantDispatch = GetMlasPlatform().GemmU8S8Dispatch;
diff --git a/onnxruntime/core/mlas/lib/sgemm.cpp b/onnxruntime/core/mlas/lib/sgemm.cpp
index 4d7a1ceb4e..83b4b51066 100644
--- a/onnxruntime/core/mlas/lib/sgemm.cpp
+++ b/onnxruntime/core/mlas/lib/sgemm.cpp
@@ -1061,7 +1087,7 @@ Return Value:
 
         size_t RowsHandled;
 
-#if defined(MLAS_TARGET_AMD64_IX86) || defined(MLAS_TARGET_POWER) || defined(MLAS_TARGET_LARCH64)
+#if 0
         RowsHandled = GetMlasPlatform().GemmFloatKernel(A, B, C, CountK, CountM, CountN, lda, ldc, alpha, ZeroMode);
 #else
         if (ZeroMode) {
@@ -1158,7 +1184,7 @@ Return Value:
 
     if (M == 1 && TransA == CblasNoTrans && alpha == 1.0f && (beta == 0.0f || beta == 1.0f)) {
 
-#if defined(MLAS_TARGET_AMD64)
+#if 0
 
         MLAS_SGEMM_KERNEL_M1_ROUTINE* SgemmKernelM1Routine;
 
@@ -1193,7 +1219,7 @@ Return Value:
 
     if (N == 1 && ldb == 1 && ldc == 1 && alpha == 1.0f && (beta == 0.0f || beta == 1.0f)) {
 
-#if defined(MLAS_TARGET_AMD64)
+#if 0
 
         MLAS_SGEMM_KERNEL_M1_ROUTINE* SgemmKernelM1Routine;

Full list of fixed tests on s390x:

fixed tests:

CPU_U8S8_Precision_Tests.QAttention
GraphTransformationTests.FuseConvBnAddMulFloat16
QDQTransformerTests.DQMatMulNotConvertedToMatMulNBits_ShapeMismatch
QDQTransformerTests.DQMatMulNotConvertedToMatMulNBits_ShapeMismatch_Cuda
QDQTransformerTests.DQMatMulConvertedToMatMulNBits
QDQTransformerTests.DQMatMulConvertedToMatMulNBits_Cuda
QDQTransformerTests.Conv_U8X8U8_Bias_Not_i32
QDQTransformerTests.Conv_U8X8S8
QDQTransformerTests.Conv_S8X8U8
QDQTransformerTests.Conv_S8X8S8
QDQTransformerTests.MatMul_U8U8U8
QDQTransformerTests.MatMul_U8S8S8
QDQTransformerTests.MatMul_U8U8S8
QDQTransformerTests.MatMul_U8S8U8
QDQTransformerTests.MatMul_S8S8S8
QDQTransformerTests.MatMul_S8U8U8
QDQTransformerTests.MatMul_S8U8S8
QDQTransformerTests.MatMul_S8S8U8
QDQTransformerTests.Gemm_U8U8U8
QDQTransformerTests.Gemm_U8S8S8
QDQTransformerTests.Gemm_U8U8S8
QDQTransformerTests.Gemm_U8S8U8
QDQTransformerTests.Gemm_S8S8S8
QDQTransformerTests.Gemm_S8S8U8
QDQTransformerTests.QLinearMatMul
QDQTransformerTests.MatMul_No_Fusion
QDQTransformerTests.MatMulIntegerToFloat
QDQTransformerTests.ConvAveragePoolReshape_Int8_Fail
QDQTransformerTests.DQForward_MutilpleSteps
InferenceSessionTests.TestBindCpu
InferenceSessionTests.TestTruncatedSequence
OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild
OrtModelOnlyTests.UpdateOrtModelVersion
OrtModelOnlyTests.SerializeToOrtFormatMLOps
OrtModelOnlyTests.LoadOrtFormatModelMLOps
OrtModelOnlyTests.LoadOrtFormatModelMLOpsFromBuffer
OrtModelOnlyTests.LoadOrtFormatModelMLOpsFromBufferNoCopy
AttnLSTMTest.ForwardLstmWithBahdanauAMZeroAttention
AttnLSTMTest.ForwardLstmWithBahdanauAM
AttnLSTMTest.ForwardLstmWithBahdanauAMShortenSeqLength
AttnLSTMTest.ReverseLstmWithBahdanauAMShortenSeqLength
AttnLSTMTest.BidirectionLstmWithBahdanauAMShortenSeqLength
AttnLSTMTest.BidirectionLstmWithBahdanauAM2BatchShortenSeqLen
AttentionTest.AttentionBatch1
AttentionTest.AttentionBatch1WithQKVAttr1
AttentionTest.AttentionBatch1WithQKVAttr2
AttentionTest.AttentionBatch1AttentionBias
AttentionTest.AttentionBatch2AttentionBias
AttentionTest.AttentionBatch2
AttentionTest.AttentionMaskPartialSequence
AttentionTest.AttentionMaskExceedSequence
AttentionTest.AttentionNoMaskIndex
AttentionTest.AttentionUnidirectional
AttentionTest.AttentionEmptyPastState
AttentionTest.AttentionPastStateBatch1
AttentionTest.AttentionPastStateBatch2
AttentionTest.AttentionPastStateBatch2WithPadding
AttentionTest.AttentionBatch2MaskIndex2
AttentionTest.AttentionRightPaddingMaskIndex2
AttentionTest.AttentionLeftPaddingMaskIndex2
AttentionTest.AttentionBatch2LeftPaddingMaskIndex2
AttentionTest.Attention3DMask
AttentionTest.AttentionBatch2AttentionMask
AttentionTest.AttentionUnidirectional3DMask
AttentionTest.AttentionUnidirectionalAttentionMask
AttentionTest.AttentionWithNormFactor
AttentionTest.AttentionMask1DEndNoWord
AttentionTest.AttentionMask1DNoWord
AttentionTest.AttentionMask2DNoWord
AttentionTest.AttentionMask3DNoWord
AttentionTest.AttentionDummyMask2D
AttentionTest.AttentionMaskIndexOutOfRange
AttentionTest.AttentionPrunedModel
AttentionTest.SharedPrepackedWeights
ContribOpTest.WordConvEmbedding
ContribOpTest.WordConvEmbedding_valid_attribute
MathOpTest.MatMulFloatType
MathOpTest.MatMulFloatTypeInitializer
MathOpTest.MatMulSharedPrepackedWeights
FusedConvTest.Conv2D_HardSigmoid
FusedConvTest.Conv2D_Relu
FusedConvTest.Conv2D_Bias_Relu
FusedConvTest.Cpu_Conv2D_Bias_Z_Relu
FusedMatMulOpTest.FloatTypeNoTranspose
FusedMatMulOpTest.FloatTypeTransposeA
FusedMatMulOpTest.FloatTypeTransposeB
FusedMatMulOpTest.FloatTypeTransposeAB
FusedMatMulOpTest.FloatTypeScale
FusedMatMulOpTest.FloatTypeTransposeBatch
MultiHeadAttentionTest.CrossAttention_Batch2_HeadSize16_8
MultiHeadAttentionTest.CrossAttention_Batch1_HeadSize16
MultiHeadAttentionTest.CrossAttention_Batch1_HeadSize8
MultiHeadAttentionTest.CrossAttentionWithPast
MultiHeadAttentionTest.CrossAttention_DiffSequenceLengths
MultiHeadAttentionTest.SelfAttention_WithPastAndPresent_NoMask_NoAttnBias
QAttentionTest.QAttentionBatch1
QAttentionTest.QAttentionBatch2
QAttentionTest.QAttentionMaskExceedSequence
QAttentionTest.QAttentionNoMaskIndex
QAttentionTest.QAttentionUnidirectional_U8U8
QAttentionTest.QAttentionUnidirectional_U8S8
QAttentionTest.QAttentionPastState_u8u8
QAttentionTest.QAttentionPastState_u8s8
QAttentionTest.QAttentionPrunedModel
QAttentionTest.SharedPrepackedWeights
MLOpTest.LinearClassifierMulticlass
MLOpTest.LinearClassifierMulticlassProb
MLOpTest.LinearClassifierMulticlassProbSigmoid
MLOpTest.LinearClassifierBinary
MLOpTest.LinearClassifierBinaryWithLabels
MLOpTest.LinearClassifierMulticlassInt64Input
MLOpTest.LinearClassifierMulticlassInt32Input
MLOpTest.LinearClassifierMulticlassDoubleInput
MLOpTest.SVMClassifierMulticlassLinearSVC
MLOpTest.SVMClassifierLinear
MLOpTest.SVMRegressorSVC
MLOpTest.SVMRegressorNuSVC
MLOpTest.SVMRegressorNuSVCPolyKernel
MLOpTest.SVMRegressorLinear
Einsum.ExplicitEinsumAsMatmul
Einsum.ExplicitEinsumAsMatmulNhcw
Einsum.ExplicitEinsumAsMatmulNhcwTransposeA
Einsum.ExplicitEinsumAsMatmulNhcwTransposeB
Einsum.ExplicitEinsumAsMatmul_Multi_Input
Einsum.ExplicitEinsumAsBatchedMatmul
Einsum.ExplicitEinsumAsBatchedMatmulWithBroadcasting_0
Einsum.ExplicitEinsumAsBatchedMatmulWithBroadcasting_1
Einsum.ExplicitEinsumAsMatmul_OutputTransposed
Einsum.ImplicitEinsumAsMatmul
Einsum.ImplicitEinsumAsMatmul_Multi_Input
Einsum.ImplicitEinsumAsBatchedMatmul
Einsum.ImplicitEinsumAsBatchedMatmulWithBroadcasting_0
Einsum.DiagonalWithMatmul
Einsum.ExplicitEinsumAsTensorContraction
Einsum.ExplicitEinsumAsTensorContractionReshapeFinal
Einsum.ExplicitEinsumAsTensorContractionReshapeLeft
Einsum.ExplicitEinsumAsTensorContractionSameInput
Einsum.ImplicitEinsumAsTensorContraction
Einsum.EinsumTransposeMatMulTwoInputsTestSuite
GemmOpTest.SharedPrepackedWeights
GemmOpTypedTests/0.TestGemmScalarBroadcast
GemmOpTypedTests/0.TestGemm2DBroadcast_2
GemmOpTypedTests/0.TestGemmFalseBroadcast
GemmOpTypedTests/0.TestGemmBroadcast
GemmOpTypedTests/0.TestGemmTrans
GemmOpTypedTests/0.TestGemmTransB
GemmOpTypedTests/0.TestGemmTransB_1
GemmOpTypedTests/0.TestGemmAlpha
GemmOpTypedTests/0.TestGemmBeta
GemmOpTypedTests/0.TestGemmNaN
GemmOpTypedTests/0.TestGemmAlphaBeta
GemmOpTypedTests/0.TestGemm2DBroadcast_1
GemmOpTypedTests/0.TestGemmNoTrans
GemmOpTypedTests/0.MissingBias
GemmOpTypedTests/0.TestGemmWithAlphaOpset11
ConvTest.Conv1D_2
ConvTest.Conv1D_Bias
ConvTest.Conv2D_1
ConvTest.Conv2D_Bias_1
ConvTest.Conv2D_Bias_2
ConvTest.Depthwise2D_Bias_Group15
ConvTest.Conv1D_asymmetric_padding
ConvTest.Conv_AutoPad_with_non_default_strides
ConvTransposeTest.ConvTranspose_1D
ConvTransposeTest.ConvTranspose_2D_C2
ConvTransposeTest.ConvTranspose_2D_OutputShape_1
ConvTransposeTest.ConvTranspose_1D_OutputShape_1_group_2_for_transpose_path
ConvTransposeTest.ConvTranspose_2D_OutputShape_1_group_2_for_transpose_path
ConvTransposeTest.ConvTranspose_onnx2
ConvTransposeTest.ConvTranspose_onnx_group
ConvTransposeTest.ConvTranspose_DefaultStridesAndDilations
ConvTransposeTest.SharedPrepackedWeights
ReductionOpTest.ReduceSum_KRK_parallel
GRUTest.ForwardDefaultActivationsSimpleWeightsNoBiasTwoRows
GRUTest.ReverseDefaultActivationsSimpleWeightsNoBiasTwoRows
GRUTest.BidirectionalDefaultActivationsSimpleWeightsNoBias
GRUTest.BidirectionalDefaultActivationsSimpleWeightsNoBiasLinearBeforeReset
GRUTest.ForwardDefaultActivationsSimpleWeightsWithBiasBatchParallel
GRUTest.ForwardDefaultActivationsSimpleWeightsWithBiasBatchParallelLinearBeforeReset
GRUTest.ReverseDefaultActivationsSimpleWeightsWithBiasBatchParallelLinearBeforeReset
GRUTest.ForwardDefaultActivationsSimpleWeightsWithBiasLinearBeforeReset
GRUTest.ReverseDefaultActivationsSimpleWeightsWithBiasLinearBeforeReset
GRUTest.ONNXRuntime_TestGRUOpForwardBasic
GRUTest.ONNXRuntime_TestGRUOpBackwardBasic
GRUTest.ONNXRuntime_TestGRUOpBidirectionalBasic
GRUTest.ONNXRuntime_TestGRUOpForwardActivation
GRUTest.ONNXRuntime_TestGRUOpForwardInitialHiddenState
GRUTest.ONNXRuntime_TestGRUOpForwardBatch
GRUTest.ONNXRuntime_TestGRUOpForwardBatchLinearBeforeReset
GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLength
GRUTest.ONNXRuntime_TestGRUOpGrowBatchSequenceLengthLinearBeforeReset
GRUTest.ONNXRuntime_TestGRUOpSequenceLengthWithBidirectionalLinearBeforeResetB1
GRUTest.ONNXRuntime_TestGRUOpSequenceLengthWithBidirectionalLinearBeforeResetB2
GRUTest.ONNXRuntime_TestGRUOpSequenceLengthWithBidirectionalLinearBeforeReset
GRUTest.ONNXRuntime_TestGRUOpShorterSeqInMiddle
GRUTest.ONNXRuntime_TestGRUOpZeroSeqInMiddle
GRUTest.ONNXRuntime_TestGRUOpSequenceLengthWithPartialZero
GRUTest.ONNXRuntime_TestGRUOpSequenceLengthShorterThanInputSequenceLength
GRUTest.ONNXRuntime_TestGRUPositiveActivationClipping
LSTMTest.ReverseSimpleWeightsNoBiasTwoRows
LSTMTest.BidirectionalSimpleWeightsNoBiasTwoRows
LSTMTest.MixedSequenceLengthsReverse
LSTMTest.BatchParallelFalseSeqLengthGreaterThanOne
LSTMTest.LargeBatchWithClip
LSTMTest.ONNXRuntime_TestLSTMForwardPeepHole
LSTMTest.ONNXRuntime_TestLSTMBidirectionalBasic
LSTMTest.ONNXRuntime_TestLSTMForwardNoBiasUsePeepholes
LSTMTest.ONNXRuntime_TestLSTMForwardInputForget
LSTMTest.ONNXRuntime_TestLSTMForwardClip
LSTMTest.ONNXRuntime_TestLSTMBackward
LSTMTest.ONNXRuntime_TestLSTMBackward_gpu
LSTMTest.ONNXRuntime_TestLSTMForwardHiddenState
LSTMTest.ONNXRuntime_TestLSTMForwardCellState
LSTMTest.ONNXRuntime_TestLSTMActivation
LSTMTest.ONNXRuntime_TestLSTMBatchReallocation
LSTMTest.ONNXRuntime_TestLSTMOutputWrite
LSTMTest.ONNXRuntime_TestLSTMSequenceLengthPartialZeros
LSTMTest.ONNXRuntime_TestLSTMSequenceLengthShorterThanInputSequenceLength
LSTMTest.ONNXRuntime_TestLSTMSequenceLengthShorterThanInputSequenceLengthNoP
LSTMTest.ONNXRuntime_TestLSTMShorterSeqInMiddle
LSTMTest.ONNXRuntime_TestLSTMZeroSeqInMiddle
RNNTest.RNN_bidirectional_bias_initial_zigged_batch
RNNTest.RNN_bidirectional_zigged_batch
RNNTest.RNN_reverse_direction_zigged_batch
RNNTest.RNN_forward_direction_zigged_batch
RNNTest.RNN_bidirectional_0
RNNTest.RNN_bidirectional_1
RNNTest.RNN_reverse_direction
RNNTest.RNN_bidirectional_with_sequence_lens
InternalTestingEP.TestSaveAndLoadOrtModel
InternalTestingEP.TestLoadOrtModel
InternalTestingEP.TestLoadOrtModelWithReducedOpCoverage
MathGemmTests/MathGemmTest.GemmNoTransNoTrans
MathGemmTests/MathGemmTest.GemmNoTransTrans
EinsumTransposeMatMulThreeInputsTests/EinsumTransposeMatMulThreeInputsTest.EinsumTransposeMatMulThreeInputsTestSuite

yufenglee · 2024-09-19T05:39:59Z

onnxruntime/core/mlas/lib/scalar/SgemmKernelScalar.cpp

-            BElements01 = B[5];
-            BElements02 = B[6];
-            BElements03 = B[7];
+            BElements00 = b[16];


in scalar, the packing wide is 4 not 16.

yufenglee · 2024-09-19T05:41:08Z

This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific optimized function is not available yet, like s390x.

Description

Matrix B is packed by 16 elements, thus new row starts 16 items later. Also, for next C increment index only by 1 for each increment of C.

Motivation and Context

This change fixes mlas sgemm fallback implementation for all architectures which don't have architecture-specific implementations available, like s390x.

why is it related to MatMul_U8S8S8, a quantization fusion?

Fix MlasSgemmKernel: properly process more than 2 rows

b9ee2c7

This change fixes multiple tests like QDQTransformerTests.MatMul_U8S8S8, for all architectures where architecture-specific optimized function is not available yet, like s390x.

AlekseiNikiforovIBM requested a review from a team as a code owner September 18, 2024 11:49

yufenglee reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MlasSgemmKernel: properly process more than 2 rows #22125

Fix MlasSgemmKernel: properly process more than 2 rows #22125

AlekseiNikiforovIBM commented Sep 18, 2024

microsoft-github-policy-service bot commented Sep 18, 2024

Contribution License Agreement

AlekseiNikiforovIBM commented Sep 18, 2024

yufenglee Sep 19, 2024

yufenglee commented Sep 19, 2024

Description

Motivation and Context

Fix MlasSgemmKernel: properly process more than 2 rows #22125

Are you sure you want to change the base?

Fix MlasSgemmKernel: properly process more than 2 rows #22125

Conversation

AlekseiNikiforovIBM commented Sep 18, 2024

Description

Motivation and Context

microsoft-github-policy-service bot commented Sep 18, 2024

Contribution License Agreement

AlekseiNikiforovIBM commented Sep 18, 2024

yufenglee Sep 19, 2024

Choose a reason for hiding this comment

yufenglee commented Sep 19, 2024

Description

Motivation and Context