The project's home page: https://github.com/marknelsonengineer/BenchMark (hosted by GitHub)
The source code is documented at: https://www2.hawaii.edu/~marknels/BenchMark (hosted by The University of Hawaiʻi at Mānoa)
BenchMark is a toolset for analysing the relative performance of various things that I'm interested in.
First, I'm shocked and thrilled at the performance of Release-mode memcpy()
and memset()
functions from glibc. Look at these results:
Results of memcpy()
in CPU ticks using different compilers & options
n | Release | MinSizeRel | Clang |
---|---|---|---|
8 | 1 | 13 | 11 |
16 | 2 | 13 | 9 |
32 | 2 | 13 | 12 |
64 | 6 | 13 | 12 |
128 | 4 | 19 | 10 |
256 | 4 | 22 | 15 |
512 | 12 | 27 | 28 |
1024 | 20 | 38 | 31 |
2048 | 38 | 67 | 56 |
4096 | 86 | 141 | 94 |
8192 | 172 | 485 | 179 |
16384 | 744 | 719 | 840 |
32768 | 1836 | 1957 | 1698 |
65536 | 3264 | 4240 | 3386 |
- Notice how the speeds are about the same when
n
< 256 - Notice how much more efficient the Release build over Clang or MinSizeRel,
especially at low
n
. - The results of
memcpy()
are so good, it makes me doubt whether I can improve the speed with hand-coded assembly. - Results for
memset()
are very similar tomemcpy()
(Not shown)
Build | Compiler | Options |
---|---|---|
Release | Statically linked gcc executable | -lstdc++ -fuse-ld=gold -march=native -mtune=native -Ofast -funroll-loops -static -mfma |
MinSizeRel | Dynamically linked gcc executable | -lstdc++ -fuse-ld=gold -march=native -mtune=native -Oz -mfma |
Clang | Dynamically linked Clang executable | -march=native |
The tests were performed on a MacBook Pro:
- Architecture: Coffee Lake – 9th Generation Intel Core
- Cache line size: 64 bytes