-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: findHead scan optimization #3899
base: dev
Are you sure you want to change the base?
gc: findHead scan optimization #3899
Conversation
e282884
to
565a24d
Compare
I wonder if we want to leave an old copy of the code disabled unless the runtime is built with Also, there is a similar forward scanning loop in |
9a315e8
to
565a24d
Compare
I need to take a closer look at the code still, but I like how it's kept small. Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer
I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).
Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement. |
32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.
It also looked to me that first Regarding the code size, I suspect it is due to use of |
@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :) |
It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620 On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc |
Thanks for the info, it is good to know.
It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway. |
Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line. |
16f759a
to
47d0cb1
Compare
@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the |
Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench
So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat ) |
@dgryski Here are some results from benchmarking. I also ran both algorithms side by side (also comparing the output): One strange thing I noticed is that on RP-2040 board Still didn't figure out what is the cause for this since only change is in the |
Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.