Skip to content

marcellofuschi/metal-matmul-kernel-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Optimizing a Metal Matmul Kernel

siboehm's post explains how to iteratively improve the performance of a CUDA kernel for matrix multiplication.

This repo contains a reimplementation of those kernels (not all yet) on Metal, Apple's GPUs compute API.

Running a kernel

./src/run.py

Performance

Performance on M1 Pro:

Kernel GFLOPs/s
1: Naive 20
2: GMEM Coalescing 280
3: SMEM Caching -
4: 1D Blocktiling -
5: 2D Blocktiling -
6: Vectorized Mem Access -
9: Autotuning -
10: Warptiling -

About

Optimizing a Metal Matmul Kernel

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published