Skip to content

Exploring the scalable matrix extension of the Apple M4 processor

License

Notifications You must be signed in to change notification settings

tzakharko/m4-sme-exploration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring SME performance of Apple M4

Introduction

The 2024 iPad Pro, featuring the Apple M4 chip, is the first device available to the public that supports ARM's scalable matrix extension (SME). Although Apple has included a matrix accelerator in its devices since 2019, it used a proprietary instruction set inaccessible to developers, who officially could only use Apple-provided numerical libraries. The M4 chip with SME changes this by allowing programmers to write low-level code that directly targets the matrix hardware, bringing potential performance improvements to scientific and machine learning algorithms.

This investigation aims to explore the performance of the M4 SME hardware. I am particularly interested in how it can be used to accelerate vector operations. The initial experiments only study the peak compute rate, without memory operations. I expect to extend this project as new aspects of the hardware are being tested. Pull requests and commentary are welcome!

In the current version I am only investigating the SME unit in the P-core cluster using a single CPU thread only. While the E-core cluster has it's own (slower) AMX unit, I will not discuss it's performance or properties.

Running the tests

Build the Xcode project and run it on your M4 iPad. Note that you will need to set up your team identifier to correctly sign the executable. The microbenchmark code is in sme_tests.c. This file is generated by the Python script in tools.

Entering the streaming mode with smstart will set all the state to zero. The registers are not initialized with known data prior to the test. I experimented loading random data into the SVE registers and it does not make any difference for performance. I have therefore omitted initialization for simplicity reasons.

Brief overview of SME

SME builds on ARM's scalable vector extension (SVE) and extends its concepts to two-dimensional tiles. SVE features 32 scalable SIMD registers (Z0-Z31) with VL (vector length) bits each, and SME adds a VLxVL two-dimensional tile storage called ZA. One aspect of the scalable architecture is that VL is implementation-specific and can only be known at run-time. This leads to all kinds of challenges for the programming model. Another essential feature of SME is that the SME vector length can differ from the "regular" SVE SIMD vector length. The idea here is to allow the matrix hardware to be implemented as a separate hardware unit dedicated to processing large amounts of data. ARM solves this by introducing a separate "SVE streaming mode ." The implementation "switches" to the SME unit in streaming mode, which can feature a different vector length and new capabilities. Also, when in streaming mode, only a subset of SVE instructions are supported (e.g., some advanced data swizzling instructions are not supported as they could be expensive to implement on hardware dedicated to matrix processing). Instead, the "regular SVE" mode uses the CPU SIMD registers and execution units, which are more flexible but might have lower total compute capability and no advanced matrix support. Apple M4 implements the streaming mode SVE/SME but does not support regular SVE (for CPU SIMD instructions, Apple still uses 128-bit Neon).

An important concept of SME is the tile storage ZA. The primary motivation for using rectangular tiles is matrix multiplication. In SME, matrix multiplication relies on outer products, where all elements of a first vector are multiplied with all elements of the second vector. To implement matrix multiplication, we can calculate outer products for each row and column of two matrices and accumulate the results. This is an illustration of how the outer product works from an Apple AMX patent.

It is implied that the hardware itself is organized similarly, with configurable multiply-accumulate units and data paths that can combine vector inputs and produce a grid of results. SME supports a wide range of data types and combinations, including widening instructions such as accumulating products of 8-bit integers into 32-bit tiles.

Tile storage can be partitioned and accessed in different ways, such as vector-like slices, groups of slices, and rectangular sub-tiles. SME instructions can impose different data layouts, which must be considered when designing algorithms. The ARM Architecture Reference Manual sections B1.4.9 - B1.4.12 describe different layouts and partitions.

SME in Apple M4

Apple's matrix accelerator is a dedicated hardware unit — it is not part of the CPU core. There is one AMX/SME block in a CPU cluster, shared by all CPU cores. This has a number of interesting consequences. First, the matrix accelerator has access to much higher bandwidth than the individual CPU cores, since it is directly fed from the cluster L2. Second, the latency of executign SME instructions is high, as data communication needs to happen via the L2 cache (there is presumably a fast control bus to share the execution state). Third, one does not need to resort to parallel programming to harvest the performance benefits of SME. Initial experiments suggest that a single CPU tread can already achieve peak processing rate on the SME unit. Finally, those seeking highest possible performance can use on-CPU SIMD (Neon) and SME simultaneously for an additional boost.

It is important to note that the Apple CPU does not support SVE SIMD instructions. A limited subset of SVE is supported by the SME block, and it needs to be in the streaming SVE mode to access these instructions. The scalable vector length (VL) on M4 is 512-bit, meaning that each register is 64-byte wide and that the ZA storage is 64x64 or 4096 bytes large. The SME unit can sustain 2000GFLOPS of FP32 multiply-accumulate, which, assuming a 16x16 arrangement of compute units, gives us an estimated 3.9Ghz operating frequency (which is very similar to the multi-core frequency of the P cores).

Apple M4 MACs can work with a wide range of data types, including 8-bit, 16-bit, and 32-bit integers, and 16-bit, 32-bit, and 64-bit floating point, as well as 16-bit brain floating point. Not all data type combinations are supported. In particular, f16 is only supported when accumulating to f32, and i16 can only be accumulated to i64.

As we will see, the SVE/SME abstraction is leaky. In theory, the attaction of a scalable vector/matrix instruction set is designing an algorithm once and running it on all kinds of future hardware with optimal performance. In practice, hardware can behave differently. One notable example is using the Apple SME unit to accelerate vector operations. The most straightworward way is using the FMLA instruction in streaming SVE mode. This instruction performs vector multiplication with accumulation into a vector destination. However, as shown by the team at Uni Jena, this only reaches a dissapointing 31 GFLOPS for the f32 data format, considerably less than what the Neon SIMD of an M4 P-core is capable of. Does this mean that M4 SME is useless for vector operations? Not at all! As the execution units appear to be closely associated with the tile storage, we can achieve much better performance by using a variant of FMLA that operates on ZA storage instead (introduced in SME version 2). Using a FMLA that works on four pairs of SVE vectors simultaneously and accumulates into four ZA storage slices, we get much more impressive 250 GFLOPS. These are the pitfals that expect low-level programmers trying to utilise these new features.

Results

SME features

The following SME features are reported for Apple M4

  • FEAT_SME
  • FEAT_SME2
  • SME_F32F32
  • SME_BI32I32
  • SME_B16F32
  • SME_F16F32
  • SME_I8I32
  • SME_I16I32
  • FEAT_SME_F64F64
  • FEAT_SME_I16I64

Notably missing is 8-bit floating point support and operations on half-precision (16-bit) floating point except accumulate to single-precision (32-bit). Brain-float 16-bit floating point is instead supported fully.

I do not know what SME_BI32I32 refers to. Possibly this is a typo in the feature string and it is supposed to be I32I32 i.e. operation on 32-bit integers?

SME matrix multiplication performance

SME matrix multiplication is done with outer products. A single outer product multiplies all elements of two vectors and accumulates them into a ZA tile. There are also widening forms of outer product instructions, e.g., multiplying two 32-wide fp16 vectors and accumulating into a 16x16 fp32 tile. The mismatch in data size is handled by treating the input vectors as matrices and accumulating the result of matrix multiplication. In the above case, the instruction will multiply a 16x2 fp16 matrix with a 2x16 fp16 matrix and accumulate the resulting 16x16 matrix into the tile.

For optimal use of the SME unit, it's crucial to understand that outer product instructions are pipelined. This means that to achieve the maximal possible compute rate, we must execute sequences of multiple instructions. A strategy to consider is accumulating to different ZA tiles (this is also pointed out by the Jena team). For instance, when accumulating to fp32, there are four tiles ZA0-ZA4.

The table below shows the results of executing the MOPA (outer product and accumulate) instruction for various datatypes and with different numbers of ZA tiles used for accumulation. The column type is the data type (two types are used for widening operations). The column ZA tiles is the number of different tiles used for accumulation ('full' means that the entire ZA storage is used). Finally, GFLOPS is the measured compute rate in operations. A single MAC counts as two operations (multiplication + addition). In the case of integer data, the more correct term would be GOPS.

type ZA tiles GFLOPS
f32 4 (full) 2005.3
f32 3 1503.02
f32 2 1003.15
f32 1 500.63
f64 8 (full) 501.73
f64 7 500.8
f64 6 501.1
f64 5 501.38
f64 4 500.39
f64 3 375.73
f64 2 250.96
f64 1 125.43
f16f32 4 (full) 4016.73
f16f32 3 3008.5
f16f32 2 2007.93
f16f32 1 1003.95
i16i32 4 (full) 4014.67
i16i32 3 4015.98
i16i32 2 4015.35
i16i32 1 2001.18
i8i32 4 (full) 16047.31
i8i32 3 16066.17
i8i32 2 16059.66
i8i32 1 8035.97

The maximal matrix multiplication throughput achievable on M4 SME is 16 TOPS when accumulating i8 values into an i32 tile. The maximal floating point throughput is 4TFLOPS working with f16. The SME unit can also sustain 0.5 TFLOPS of double-precision matrix multiplication. I have yet to test the 16-bit brain floating point format (BFLOAT16 or BF16).

The most important result of these tests is that we need at least four instructions to different ZA tiles in oder to reach the SME unit's peak performance. The only difference is the widening integer instructions, which can achieve peak performance with only two instructions. What is also notable is that integer multiplication and floating-point multiplication run at the same rate.

SME vector FMA performance

Working with vectors in M4 SME is not straightforward, as we need to utilize the ZA storage for the best performance. SME vector instructions use two or four register pairs (vector grpoups or VGs) as operands and accumulate the multiplication result into multiple strided slices of the ZA storage. One particular challenge is specifying the output slices. SME uses a tricky addressing mode: the initial slice is computed as <base-reg> + offset, where <base-reg> is a 32-bit CPU register and offset is a value encoded in the instruction. Probably due to limited instruction encoding space, the range of usable registers and offsets is limited. Most ALU instructions can only use four base registers (w8-w11) and a three-bit offset (0-7). There are also widening instructions, such as FMLAL, where a lower-prevision input is widened and written to twice as many output slices. One example is multiplying four pairs of 32-wide fp16 registers and accumulating them into eight 16-wide fp32 ZA slices. These instruction variants use only two bits for the offset, which encodes two subsequent slices. There are a lot of architectural details that require programmers's attention.

The table below shows the results of executing the MLA (multiply-accumulate) instruction for various datatypes with different numbers of SVE Z registers and ZA accumulator slices. The column type is the data type (two types are used for widening operations). Results for integer data will be added later. The column VG is the vector group size (two or four register pairs). The column Z registers shows the number of registers used per iteration of the microbenchmark loop. For example, 16 instructions with the VG=4 means that 64 registers are used for each input. These registers do not have to be unique (in fact, the benchmark reuses the same registers repeatedly). The column ZA slices is the number of unique accumulator slices written per loop iteration ('full' means that the entire ZA storage is used). Finally, GFLOPS is the measured compute rate in operations. A single MAC counts as two operations (multiplication + addition).

type VG Z registers ZA slices GFLOPS
f32 4 64, 64 64 (full) 251.29
f32 4 60, 60 60 251.24
f32 4 56, 56 56 251.39
f32 4 52, 52 52 251.27
f32 4 48, 48 48 250.55
f32 4 44, 44 44 251.39
f32 4 40, 40 40 251.21
f32 4 36, 36 36 251.44
f32 4 32, 32 32 251.18
f32 4 28, 28 28 251.3
f32 4 24, 24 24 251.39
f32 4 20, 20 20 251.11
f32 4 16, 16 16 251.39
f32 4 12, 12 12 251.43
f32 4 8, 8 8 251.56
f32 4 4, 4 4 125.59
f32 2 64, 64 64 (full) 251.21
f32 2 62, 62 62 251.38
f32 2 60, 60 60 251.42
f32 2 58, 58 58 251.28
f32 2 56, 56 56 251.34
f32 2 54, 54 54 251.36
f32 2 52, 52 52 251.36
f32 2 50, 50 50 251.34
f32 2 48, 48 48 251.41
f32 2 46, 46 46 251.27
f32 2 44, 44 44 251.36
f32 2 42, 42 42 251.43
f32 2 40, 40 40 251.21
f32 2 38, 38 38 251.38
f32 2 36, 36 36 251.32
f32 2 34, 34 34 251.44
f32 2 32, 32 32 251.47
f32 2 30, 30 30 251.16
f32 2 28, 28 28 251.37
f32 2 26, 26 26 251.4
f32 2 24, 24 24 251.23
f32 2 22, 22 22 251.4
f32 2 20, 20 20 251.42
f32 2 18, 18 18 251.18
f32 2 16, 16 16 251.21
f32 2 14, 14 14 251.41
f32 2 12, 12 12 251.37
f32 2 10, 10 10 251.45
f32 2 8, 8 8 251.29
f32 2 6, 6 6 188.54
f32 2 4, 4 4 125.7
f32 2 2, 2 2 62.72
f64 4 64, 64 64 (full) 125.72
f64 4 60, 60 60 125.64
f64 4 56, 56 56 125.69
f64 4 52, 52 52 125.33
f64 4 48, 48 48 125.68
f64 4 44, 44 44 125.67
f64 4 40, 40 40 125.72
f64 4 36, 36 36 125.61
f64 4 32, 32 32 125.72
f64 4 28, 28 28 125.73
f64 4 24, 24 24 125.5
f64 4 20, 20 20 125.72
f64 4 16, 16 16 125.7
f64 4 12, 12 12 125.69
f64 4 8, 8 8 125.71
f64 4 4, 4 4 62.57
f64 2 64, 64 64 (full) 125.67
f64 2 62, 62 62 125.74
f64 2 60, 60 60 125.71
f64 2 58, 58 58 125.72
f64 2 56, 56 56 125.76
f64 2 54, 54 54 125.71
f64 2 52, 52 52 125.72
f64 2 50, 50 50 125.76
f64 2 48, 48 48 125.8
f64 2 46, 46 46 125.73
f64 2 44, 44 44 125.78
f64 2 42, 42 42 125.63
f64 2 40, 40 40 125.76
f64 2 38, 38 38 125.71
f64 2 36, 36 36 125.78
f64 2 34, 34 34 125.77
f64 2 32, 32 32 125.64
f64 2 30, 30 30 125.75
f64 2 28, 28 28 125.7
f64 2 26, 26 26 125.8
f64 2 24, 24 24 125.78
f64 2 22, 22 22 125.6
f64 2 20, 20 20 125.78
f64 2 18, 18 18 125.8
f64 2 16, 16 16 125.82
f64 2 14, 14 14 125.78
f64 2 12, 12 12 125.51
f64 2 10, 10 10 125.8
f64 2 8, 8 8 125.76
f64 2 6, 6 6 94.29
f64 2 4, 4 4 62.89
f64 2 2, 2 2 31.43
f16f32 4 32, 32 64 (full) 502.92
f16f32 4 28, 28 56 502.76
f16f32 4 24, 24 48 503.06
f16f32 4 20, 20 40 503.24
f16f32 4 16, 16 32 502.52
f16f32 4 12, 12 24 503.14
f16f32 4 8, 8 16 503.04
f16f32 4 4, 4 8 251.29
f16f32 2 32, 32 64 (full) 501.42
f16f32 2 30, 30 60 503.23
f16f32 2 28, 28 56 503.08
f16f32 2 26, 26 52 503.16
f16f32 2 24, 24 48 502.66
f16f32 2 22, 22 44 503.13
f16f32 2 20, 20 40 503.03
f16f32 2 18, 18 36 502.72
f16f32 2 16, 16 32 502.45
f16f32 2 14, 14 28 440.24
f16f32 2 12, 12 24 377.34
f16f32 2 10, 10 20 314.42
f16f32 2 8, 8 16 251.5
f16f32 2 6, 6 12 251.09
f16f32 2 4, 4 8 251.6
f16f32 2 2, 2 4 125.78

Vector operations are considerably slower than outer products, by a factor of approximately 1/8. Since we know that the hardware is capable of delivering 2'000 GFLOPS of single-precision MAC, the limitation must be in data movement. When computing an outer product, just two 16-wide registers can feed the entire 256-wide array of execution units. We would need 16x as much register file bandwidth to achieve the same hardware utilization for vector FMA, which is not achievable on current hardware. The peak compute is already achievable with two VG=4 or four VG=2 FMLA instructions, meaning that the data bus is fully saturated by using eight pairs of registers. Using fewer instructions results in proportionately lower rates. The notable difference is the widening fp16 to fp32 FMLA variant with a vector group size of two, where we need eight instructions (or 16 input register pairs) to achieve the maximal throughput. It is unclear whether this is a bug in my code or idiosyncratic hardware behavior — the VG=4 variant of the same instruction already achieves a maximal rate with eight input register pairs.

On M4, using VG=4 instruction variants seems best for maximizing performance. The current hardware can reach the peak processing rate with only two such instructions, which could change on future hardware. SME looks like a complicated instruction set to me, and I can imagine that working with the ZA storage layout can be challenging.

References

  1. https://github.com/corsix/amx - description of Apple proprietary AMX extensions
  2. https://scalable.uni-jena.de/opt/sme/index.html - first look at M4 SME performance (by a team at Uni Jena)
  3. https://developer.arm.com/documentation/ddi0487/latest/ - ARM architecture manual, with description of SVE and SME

About

Exploring the scalable matrix extension of the Apple M4 processor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published