gc: findHead scan optimization #3899

HattoriHanzo031 · 2023-09-08T17:26:22Z

Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.

dgryski · 2023-09-08T20:52:22Z

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result. Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Also, there is a similar forward scanning loop in findNext(). I wonder if that should also be tweaked.

aykevl · 2023-09-09T15:01:42Z

I need to take a closer look at the code still, but I like how it's kept small.
It's unfortunate that there's a ~100 byte increase in binary size for baremetal chips. Normally that would be a blocker for me but considering the performance benefits I'm not so sure.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result.

I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement.
(Deciding at compile time is relatively easy: for example if you want to only do this on wasm then you can simply use if GOARCH == "wasm" - this check will be optimized away entirely).

HattoriHanzo031 · 2023-09-09T16:00:59Z

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

It also looked to me that first if could be avoided and I tried to eliminate it, but the problem is that the first byte we get is a special case because it can also contain bits from different object so it must be handled differently. There still might be a way to do it that I didn't find.

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table. I just built the compiler using self-built LLVM so I can tun the tests and see what can I do to reduce the size. I first have to confirm if that is the reason, because if we are already using this lookup table somewhere it is probably reused (since it is defined in the library) and not add to binary size (and also since lookup table is 256 bytes and increase is around 100). One idea I have is to checking just 6 bits instead of 8 because if we found the byte that is not all tails it must have at least one block that will not be 00 after XOR. This would reduce lookup table to 64bytes.

HattoriHanzo031 · 2023-09-09T20:51:50Z

@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :)
Unfortunately, it only saves 4 bytes, but better than nothing, and the code is much more clear.

aykevl · 2023-09-10T17:19:00Z

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table.

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc
On armv6m (microbit for example) LLVM emits a library call, which is probably the main reason for the big increase in binary size: https://godbolt.org/z/YjrqE56Yz. Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

HattoriHanzo031 · 2023-09-11T12:33:36Z

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

Thanks for the info, it is good to know.

Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway.

dgryski · 2024-09-19T19:55:01Z

Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line.

HattoriHanzo031 · 2024-09-20T17:12:41Z

@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the findHead call and capture them on logic analyser so I have exact duration of the call. I will use this over weekend to see the results and maybe tweak the algorithm.

dgryski · 2024-09-20T20:40:12Z

Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench

   4.0  +
        |
        |
        |                   *
        |
   3.8  +                                        *
        |
        |
        |
   3.6  +                   |                    |
        |         +-------------------+          |
        |         |         *         |          |
        |         +-------------------++-------------------+
   3.4  +                   |          |         *         |
        |                              +-------------------+
        |                              +-------------------+
        |                                        |
   3.2  +--------------------------------------------------------------
                          dev.s                opt.s

File   N   Mean  Stddev
dev.s  20  3.51  0.11    (control)
opt.s  20  3.39  0.12    (3.39 < 3.51 ± 0.07, p = .003)

So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat )

HattoriHanzo031 · 2024-09-23T20:51:09Z

@dgryski Here are some results from benchmarking. findHead is on average around 2 times faster. It is only slower for very short searches, and the longer the search is, the better the improvement. Here is a comparisons, one channel is the durations of findHead and second channel is duration of runGC
seeed XIAO (SAMD21) board:
dev branch:

PR branch:

nice!nano board:
dev branch:

PR branch:

I also ran both algorithms side by side (also comparing the output):

One strange thing I noticed is that on RP-2040 board runGC is slower with new algorithm although findHead invocations took significantly les time:
dev branch:

PR branch:

Still didn't figure out what is the cause for this since only change is in the findHead function and it runs (on average) much faster and returns the same results. Maybe @aykevl has some explanation.

HattoriHanzo031 force-pushed the gc_find_head_optimization branch 2 times, most recently from e282884 to 565a24d Compare September 8, 2023 18:56

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 9a315e8 to 565a24d Compare September 9, 2023 10:05

HattoriHanzo031 and others added 5 commits September 20, 2024 10:08

Scan optimization

bacecaa

changed expected size so all tests can run

c98655f

Simplify algorithm

d9724db

restoring original comment

88c6a25

updated new binary sizes for TestBinarySize

47d0cb1

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 16f759a to 47d0cb1 Compare September 20, 2024 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: findHead scan optimization #3899

gc: findHead scan optimization #3899

HattoriHanzo031 commented Sep 8, 2023 •

edited

Loading

dgryski commented Sep 8, 2023

aykevl commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

aykevl commented Sep 10, 2023 •

edited

Loading

HattoriHanzo031 commented Sep 11, 2023

dgryski commented Sep 19, 2024

HattoriHanzo031 commented Sep 20, 2024

dgryski commented Sep 20, 2024

HattoriHanzo031 commented Sep 23, 2024 •

edited

Loading

gc: findHead scan optimization #3899

Are you sure you want to change the base?

gc: findHead scan optimization #3899

Conversation

HattoriHanzo031 commented Sep 8, 2023 • edited Loading

dgryski commented Sep 8, 2023

aykevl commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

aykevl commented Sep 10, 2023 • edited Loading

HattoriHanzo031 commented Sep 11, 2023

dgryski commented Sep 19, 2024

HattoriHanzo031 commented Sep 20, 2024

dgryski commented Sep 20, 2024

HattoriHanzo031 commented Sep 23, 2024 • edited Loading

HattoriHanzo031 commented Sep 8, 2023 •

edited

Loading

aykevl commented Sep 10, 2023 •

edited

Loading

HattoriHanzo031 commented Sep 23, 2024 •

edited

Loading