Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: findHead scan optimization #3899

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

HattoriHanzo031
Copy link

@HattoriHanzo031 HattoriHanzo031 commented Sep 8, 2023

Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.

@HattoriHanzo031 HattoriHanzo031 force-pushed the gc_find_head_optimization branch 2 times, most recently from e282884 to 565a24d Compare September 8, 2023 18:56
@dgryski
Copy link
Member

dgryski commented Sep 8, 2023

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result. Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Also, there is a similar forward scanning loop in findNext(). I wonder if that should also be tweaked.

@aykevl
Copy link
Member

aykevl commented Sep 9, 2023

I need to take a closer look at the code still, but I like how it's kept small.
It's unfortunate that there's a ~100 byte increase in binary size for baremetal chips. Normally that would be a blocker for me but considering the performance benefits I'm not so sure.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result.

I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement.
(Deciding at compile time is relatively easy: for example if you want to only do this on wasm then you can simply use if GOARCH == "wasm" - this check will be optimized away entirely).

@HattoriHanzo031
Copy link
Author

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

It also looked to me that first if could be avoided and I tried to eliminate it, but the problem is that the first byte we get is a special case because it can also contain bits from different object so it must be handled differently. There still might be a way to do it that I didn't find.

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table. I just built the compiler using self-built LLVM so I can tun the tests and see what can I do to reduce the size. I first have to confirm if that is the reason, because if we are already using this lookup table somewhere it is probably reused (since it is defined in the library) and not add to binary size (and also since lookup table is 256 bytes and increase is around 100). One idea I have is to checking just 6 bits instead of 8 because if we found the byte that is not all tails it must have at least one block that will not be 00 after XOR. This would reduce lookup table to 64bytes.

@HattoriHanzo031
Copy link
Author

@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :)
Unfortunately, it only saves 4 bytes, but better than nothing, and the code is much more clear.

@aykevl
Copy link
Member

aykevl commented Sep 10, 2023

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table.

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc
On armv6m (microbit for example) LLVM emits a library call, which is probably the main reason for the big increase in binary size: https://godbolt.org/z/YjrqE56Yz. Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

@HattoriHanzo031
Copy link
Author

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

Thanks for the info, it is good to know.

Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway.

@dgryski
Copy link
Member

dgryski commented Sep 19, 2024

Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line.

@HattoriHanzo031
Copy link
Author

@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the findHead call and capture them on logic analyser so I have exact duration of the call. I will use this over weekend to see the results and maybe tweak the algorithm.

@dgryski
Copy link
Member

dgryski commented Sep 20, 2024

Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench

   4.0  +
        |
        |
        |                   *
        |
   3.8  +                                        *
        |
        |
        |
   3.6  +                   |                    |
        |         +-------------------+          |
        |         |         *         |          |
        |         +-------------------++-------------------+
   3.4  +                   |          |         *         |
        |                              +-------------------+
        |                              +-------------------+
        |                                        |
   3.2  +--------------------------------------------------------------
                          dev.s                opt.s

File   N   Mean  Stddev
dev.s  20  3.51  0.11    (control)
opt.s  20  3.39  0.12    (3.39 < 3.51 ± 0.07, p = .003)

So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat )

@HattoriHanzo031
Copy link
Author

HattoriHanzo031 commented Sep 23, 2024

@dgryski Here are some results from benchmarking. findHead is on average around 2 times faster. It is only slower for very short searches, and the longer the search is, the better the improvement. Here is a comparisons, one channel is the durations of findHead and second channel is duration of runGC
seeed XIAO (SAMD21) board:
dev branch:
xiao_dev
PR branch:
xiao_pr
nice!nano board:
dev branch:
nano_dev
PR branch:
nano_pr

I also ran both algorithms side by side (also comparing the output):
nano_combined

One strange thing I noticed is that on RP-2040 board runGC is slower with new algorithm although findHead invocations took significantly les time:
dev branch:
zero_dev_2
PR branch:
zero_pr_2

Still didn't figure out what is the cause for this since only change is in the findHead function and it runs (on average) much faster and returns the same results. Maybe @aykevl has some explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants