Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big performance regression between 0.31 and 0.32 #914

Closed
tomaka opened this issue Jan 29, 2024 · 7 comments
Closed

Big performance regression between 0.31 and 0.32 #914

tomaka opened this issue Jan 29, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@tomaka
Copy link
Contributor

tomaka commented Jan 29, 2024

Calling the Metadata_metadata runtime function of the latest Westend runtime with wasmi 0.31 takes around 750ms.

Calling the same runtime function with wasmi 0.32 takes around 4.5 seconds in CompilationMode::Eager. With CompilationMode::Lazy, it's around 5 to 7 seconds.

@tomaka
Copy link
Contributor Author

tomaka commented Jan 29, 2024

Take this with a grain of salt, but I wouldn't be surprised if this was the "runtime inefficiencies" in the "rare case" that you mention in #860

The way Metadata_metadata is implemented in Rust is that there's a trait implementation on tons of structs of the runtime code, and Metadata_metadata calls a function on this trait implementation of every single struct.
For this reason, I suspect that Metadata_metadata calls tons of small functions that are spread out everywhere within the Wasm.

@Robbepop
Copy link
Member

Robbepop commented Jan 29, 2024

@tomaka Thanks for reporting!

Calling many very small functions is the worst case situation for the Wasmi (register) implementation. Wasmi (stack) actually performs better for many small function calls. Aggressive inlining could maybe help if this indeed is the problem so it is worth a try if possible. The linked relaxed br+cmp encoding shouldn't play a big role here since it isn't used for calls and only plays a role for ultra large Wasm functions that have jumps within the same function over very long distances (>32k Wasmi instruction words). It would be great if you could get the Wasm input into a minimize example that has similar performance behaviors but I suppose this is going to be very tricky to extract.

I agree that a 6x performance regression (750ms vs 4.5s) is not reasonable.

@Robbepop Robbepop added the bug Something isn't working label Jan 29, 2024
@Robbepop
Copy link
Member

@tomaka The exact Wasm blob that was used and ideally a trace of executed instructions might be helpful if you have them.

@tomaka
Copy link
Contributor Author

tomaka commented Jan 30, 2024

runtime.hex.txt

Here's the runtime in hexadecimal.

@Robbepop
Copy link
Member

Robbepop commented Jan 31, 2024

@tomaka

I just noticed that the given westend.wasm binary was apparently not properly optimized using wasm-opt (version 116). I was under the assumption that Polkadot runtimes are always optimized with this tool since it really improves performance and especially binary size quite a bit.

7.5M Jan 31 13:14 westend.wasm
5.9M Jan 31 13:18 westend.opt.o3.wasm  ;; compiled with -O3
5.5M Jan 31 13:23 westend.opt.oz.wasm  ;; compiled with -Oz

It could be worth a try to see if this westend.opt.oX.wasm does not have this performance issue with Wasmi. The reason for this is that Wasmi relies on proper optimizations to run at full speed since (unlike for example Wasmtime) it cannot do its own proper optimizations due to its linear time compilation promise.

Besides that I found a new opportunity to improve performance: #920
However, I have absolutely no clue how much this will really affect the performance of westend.wasm or execution in general and I am pretty sure that this is not the main offender with respect to performance here.

@Robbepop
Copy link
Member

Robbepop commented Feb 2, 2024

While inspecting the westend.wat file I saw this:

(func $memcpy (;8628;) (type 5) (param i32 i32 i32) (result i32)
  local.get 0
  local.get 1
  local.get 2
  call $_ZN17compiler_builtins3mem6memcpy17h7c9cbc967e51fdc6E
)
(func $memset (;8629;) (type 5) (param i32 i32 i32) (result i32)
  local.get 0
  local.get 1
  local.get 2
  call $_ZN17compiler_builtins3mem6memset17h7a0db38d43a2ca78E
)

Which indicates to me that the Westend runtime is compiled without the Wasm bulk-memory proposal although it has been evaluated in the past that this may yield serious performance improvements: paritytech/polkadot-sdk#36

Especially for Wasmi being an interpreter the speed-ups gained could be even way more massive since the built-in memcpy and memset instructions in bulk-memory are not themselves interpreted but run natively via Wasmi. So we should test if enabling bulk-memory Wasm proposal fixes the performance issues reported in this issue.

cc @tomaka

@tomaka
Copy link
Contributor Author

tomaka commented Feb 6, 2024

After a relatively long investigation and a lot of DMs, this turned out to be caused by debug-assertions = true in the Cargo.toml.
Disabling debug assertions is critical in order to get decent performances.

@tomaka tomaka closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants