Skip to content

Commit

Permalink
perf: keep the heap pointer in a dedicated Wasm global (#4064)
Browse files Browse the repository at this point in the history
We now optimise the bump pointer allocator by keeping the heap pointer in a Wasm global, so that the Motoko side can easily access it. This opens up the possibility that for fixed-size allocations we can simply examine a few lower bits (comparing to zero) in order to figure out that a page-crossing has occurred. If so, we only need to "commit" the new page by calling `alloc_words 0`. For the Rust side accessor functions to the `HP` are provided.

### How it works

We construct a mask at compile-time to detect page boundary crossings of the bumped pointer:
- for a (byte) increment `0b1xxxxx` (where the `xxxxx` part is arbitrary)
- and `< 0x8000` (so that the bit 15 is clear, when the LSBit is the 0th)
- we treat this as a 16-bit unsigned integer
- we count its leading zeros (will be `>= 1` and `< 16`), this is `N`.
- the 32-bit HP mask the will be `0b000000000000000011..100..0 : u32`, where there are 16 upper zeros, then `N` ones, closing with `16-N` zeros
- the bumped HP is masked with said number, and when zero, then the addition that happened while bumping the HP caused a rippling carry, and the page boundary has been crossed
- in this case we have to make sure that the new page is backed physically, so we call `Heap.alloc 0` (and discard the result).

Some benchmarks give cycle savings of 10% 😵‍💫💪!

_Note_: Currently the change is only active for _classical GC_, the incremental part will be tackled in #4078.

Further benefits:
- the accessor functions for HP (that now live on the `codegen` side) can be inlined by `wasm-opt`
- looks like `wasm-opt` (`-O3`) also specialises `call alloc_words 0; drop` to a very minimal kernel, speeding up the cold path too.

## Potential future optimisations:
- keep HP as always _skewed_, then no more per-alloc `sub 1` is needed
- extend the mask to the right by one bit — https://github.com/dfinity/motoko/pull/4064/files/4eb3dfeb7fdc10737605c631edff33b3f0976eff#r1250623654
  • Loading branch information
ggreif authored Jul 3, 2023
1 parent b3d1d21 commit f37f9a9
Show file tree
Hide file tree
Showing 10 changed files with 108 additions and 38 deletions.
4 changes: 4 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,10 @@
}
```

* Performance improvement: improved cycle consumption allocating fixed-size objects (#4064).
Benchmarks indicate up to 10% less cycles burned for allocation-heavy code,
and 2.5% savings in realistic applications.

## 0.9.4 (2023-07-01)

* motoko (`moc`)
Expand Down
4 changes: 2 additions & 2 deletions rts/motoko-rts/src/gc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ use motoko_rts_macros::*;
#[cfg(feature = "ic")]
#[non_incremental_gc]
unsafe fn should_do_gc(max_live: crate::types::Bytes<u64>) -> bool {
use crate::memory::ic::linear_memory::{HP, LAST_HP};
use crate::memory::ic::linear_memory::{getHP, LAST_HP};

// A factor of last heap size. We allow at most this much allocation before doing GC.
const HEAP_GROWTH_FACTOR: f64 = 1.5;
Expand All @@ -22,5 +22,5 @@ unsafe fn should_do_gc(max_live: crate::types::Bytes<u64>) -> bool {
(u64::from(LAST_HP) + max_live.0) / 2,
);

u64::from(HP) >= heap_limit
u64::from(getHP()) >= heap_limit
}
6 changes: 3 additions & 3 deletions rts/motoko-rts/src/gc/copying.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ unsafe fn copying_gc<M: Memory>(mem: &mut M) {
mem,
ic::get_aligned_heap_base(),
// get_hp
|| linear_memory::HP as usize,
|| linear_memory::getHP() as usize,
// set_hp
|hp| linear_memory::HP = hp,
|hp| linear_memory::setHP(hp),
ic::get_static_roots(),
crate::continuation_table::continuation_table_loc(),
// note_live_size
Expand All @@ -42,7 +42,7 @@ unsafe fn copying_gc<M: Memory>(mem: &mut M) {
|reclaimed| linear_memory::RECLAIMED += Bytes(u64::from(reclaimed.as_u32())),
);

linear_memory::LAST_HP = linear_memory::HP;
linear_memory::LAST_HP = linear_memory::getHP();
}

pub unsafe fn copying_gc_internal<
Expand Down
4 changes: 2 additions & 2 deletions rts/motoko-rts/src/gc/generational.rs
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,14 @@ unsafe fn get_limits() -> Limits {
Limits {
base: ic::get_aligned_heap_base() as usize,
last_free: linear_memory::LAST_HP as usize,
free: linear_memory::HP as usize,
free: linear_memory::getHP() as usize,
}
}

#[cfg(feature = "ic")]
unsafe fn set_limits(limits: &Limits) {
use crate::memory::ic::linear_memory;
linear_memory::HP = limits.free as u32;
linear_memory::setHP(limits.free as u32);
linear_memory::LAST_HP = limits.free as u32;
}

Expand Down
6 changes: 3 additions & 3 deletions rts/motoko-rts/src/gc/mark_compact.rs
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,9 @@ unsafe fn compacting_gc<M: Memory>(mem: &mut M) {
mem,
ic::get_aligned_heap_base(),
// get_hp
|| linear_memory::HP as usize,
|| linear_memory::getHP() as usize,
// set_hp
|hp| linear_memory::HP = hp,
|hp| linear_memory::setHP(hp),
ic::get_static_roots(),
crate::continuation_table::continuation_table_loc(),
// note_live_size
Expand All @@ -57,7 +57,7 @@ unsafe fn compacting_gc<M: Memory>(mem: &mut M) {
|reclaimed| linear_memory::RECLAIMED += Bytes(u64::from(reclaimed.as_u32())),
);

linear_memory::LAST_HP = linear_memory::HP;
linear_memory::LAST_HP = linear_memory::getHP();
}

pub unsafe fn compacting_gc_internal<
Expand Down
17 changes: 10 additions & 7 deletions rts/motoko-rts/src/memory/ic/linear_memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,18 @@ use crate::types::*;
/// Amount of garbage collected so far.
pub(crate) static mut RECLAIMED: Bytes<u64> = Bytes(0);

/// Heap pointer
pub(crate) static mut HP: u32 = 0;
// Heap pointer
extern "C" {
pub(crate) fn setHP(new_hp: u32);
pub(crate) fn getHP() -> u32;
}

/// Heap pointer after last GC
pub(crate) static mut LAST_HP: u32 = 0;

pub(crate) unsafe fn initialize() {
HP = get_aligned_heap_base();
LAST_HP = HP;
setHP(get_aligned_heap_base());
LAST_HP = getHP();
}

#[no_mangle]
Expand All @@ -29,7 +32,7 @@ pub unsafe extern "C" fn get_total_allocations() -> Bytes<u64> {

#[no_mangle]
pub unsafe extern "C" fn get_heap_size() -> Bytes<u32> {
Bytes(HP - get_aligned_heap_base())
Bytes(getHP() - get_aligned_heap_base())
}

impl Memory for IcMemory {
Expand All @@ -39,7 +42,7 @@ impl Memory for IcMemory {
let delta = u64::from(bytes.as_u32());

// Update heap pointer
let old_hp = u64::from(HP);
let old_hp = u64::from(getHP());
let new_hp = old_hp + delta;

// Grow memory if needed
Expand All @@ -48,7 +51,7 @@ impl Memory for IcMemory {
}

debug_assert!(new_hp <= u64::from(core::u32::MAX));
HP = new_hp as u32;
setHP(new_hp as u32);

Value::from_ptr(old_hp as usize)
}
Expand Down
83 changes: 73 additions & 10 deletions src/codegen/compile.ml
Original file line number Diff line number Diff line change
Expand Up @@ -1052,8 +1052,10 @@ module GC = struct
E.call_import env "ic0" "performance_counter"

let register_globals env =
(E.add_global64 env "__mutator_instructions" Mutable 0L;
E.add_global64 env "__collector_instructions" Mutable 0L)
E.add_global64 env "__mutator_instructions" Mutable 0L;
E.add_global64 env "__collector_instructions" Mutable 0L;
if !Flags.gc_strategy <> Flags.Incremental then
E.add_global32 env "_HP" Mutable 0l

let get_mutator_instructions env =
G.i (GlobalGet (nr (E.get_global env "__mutator_instructions")))
Expand All @@ -1065,6 +1067,17 @@ module GC = struct
let set_collector_instructions env =
G.i (GlobalSet (nr (E.get_global env "__collector_instructions")))

let get_heap_pointer env =
if !Flags.gc_strategy <> Flags.Incremental then
G.i (GlobalGet (nr (E.get_global env "_HP")))
else
assert false
let set_heap_pointer env =
if !Flags.gc_strategy <> Flags.Incremental then
G.i (GlobalSet (nr (E.get_global env "_HP")))
else
assert false

let record_mutator_instructions env =
match E.mode env with
| Flags.(ICMode | RefMode) ->
Expand Down Expand Up @@ -1116,8 +1129,11 @@ module Heap = struct
(* Static allocation (always words)
(uses dynamic allocation for smaller and more readable code) *)
let alloc env (n : int32) : G.t =
compile_unboxed_const n ^^
compile_unboxed_const n ^^
E.call_import env "rts" "alloc_words"

let ensure_allocated env =
alloc env 0l ^^ G.i Drop (* dummy allocation, ensures that the page HP points into is backed *)

(* Heap objects *)

Expand Down Expand Up @@ -1602,12 +1618,34 @@ module Tagged = struct
assert (!Flags.gc_strategy = Flags.Incremental);
1l

(* Note: Post allocation barrier must be applied after initialization *)
(* Note: post-allocation barrier must be applied after initialization *)
let alloc env size tag =
assert (size > 1l);
let name = Printf.sprintf "alloc_size<%d>_tag<%d>" (Int32.to_int size) (Int32.to_int (int_of_tag tag)) in
(* Computes a (conservative) mask for the bumped HP, so that the existence of non-zero bits under it
guarantees that a page boundary crossing didn't happen (i.e. no ripple-carry). *)
let overflow_mask n =
let n = Int32.to_int n in
let page_mask = Int32.sub page_size 1l in
Int32.(logand page_mask (shift_left minus_one (16 - Numerics.Nat16.(to_int (clz (of_int n)))))) in

Func.share_code0 env name [I32Type] (fun env ->
let (set_object, get_object) = new_local env "new_object" in
Heap.alloc env size ^^
let set_object, get_object = new_local env "new_object" in
let size_in_bytes = Int32.(mul size Heap.word_size) in
let half_page_size = Int32.div page_size 2l in
(if !Flags.gc_strategy <> Flags.Incremental && size_in_bytes < half_page_size then
GC.get_heap_pointer env ^^
compile_add_const ptr_skew ^^
GC.get_heap_pointer env ^^
compile_add_const size_in_bytes ^^
GC.set_heap_pointer env ^^
GC.get_heap_pointer env ^^
compile_bitand_const (overflow_mask size_in_bytes) ^^
G.if0
G.nop (* no page crossing *)
(Heap.ensure_allocated env) (* ensure that HP's page is allocated *)
else
Heap.alloc env size) ^^
set_object ^^ get_object ^^
compile_unboxed_const (int_of_tag tag) ^^
Heap.store_field tag_field ^^
Expand Down Expand Up @@ -5202,6 +5240,31 @@ module RTS_Exports = struct
edesc = nr (FuncExport (nr rts_trap_fi))
});

if !Flags.gc_strategy <> Flags.Incremental then
begin
let set_hp_fi =
E.add_fun env "__set_hp" (
Func.of_body env ["new_hp", I32Type] [] (fun env ->
G.i (LocalGet (nr 0l)) ^^
GC.set_heap_pointer env
)
) in
E.add_export env (nr {
name = Lib.Utf8.decode "setHP";
edesc = nr (FuncExport (nr set_hp_fi))
});

let get_hp_fi = E.add_fun env "__get_hp" (
Func.of_body env [] [I32Type] (fun env ->
GC.get_heap_pointer env
)
) in
E.add_export env (nr {
name = Lib.Utf8.decode "getHP";
edesc = nr (FuncExport (nr get_hp_fi))
})
end;

let stable64_write_moc_fi =
if E.mode env = Flags.WASIMode then
E.add_fun env "stable64_write_moc" (
Expand Down Expand Up @@ -5404,12 +5467,12 @@ module MakeSerialization (Strm : Stream) = struct

module Registers = struct
let register_globals env =
(E.add_global32 env "@@rel_buf_opt" Mutable 0l;
E.add_global32 env "@@rel_buf_opt" Mutable 0l;
E.add_global32 env "@@data_buf" Mutable 0l;
E.add_global32 env "@@ref_buf" Mutable 0l;
E.add_global32 env "@@typtbl" Mutable 0l;
E.add_global32 env "@@typtbl_end" Mutable 0l;
E.add_global32 env "@@typtbl_size" Mutable 0l)
E.add_global32 env "@@typtbl_size" Mutable 0l

let get_rel_buf_opt env =
G.i (GlobalGet (nr (E.get_global env "@@rel_buf_opt")))
Expand Down Expand Up @@ -6535,7 +6598,7 @@ module MakeSerialization (Strm : Stream) = struct
) ^^
get_x ^^
Tagged.allocation_barrier env ^^
set_x (* discard result *)
G.i Drop
)
| Array t ->
let (set_len, get_len) = new_local env "len" in
Expand Down Expand Up @@ -8021,7 +8084,7 @@ module FuncDec = struct

get_clos ^^
Tagged.allocation_barrier env ^^
set_clos (* discard the result *)
G.i Drop
in

if is_local
Expand Down
6 changes: 3 additions & 3 deletions test/bench/ok/alloc.drun-run.ok
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
ingress Completed: Reply: 0x4449444c016c01b3c4b1f204680100010a00000000000000000101
ingress Completed: Reply: 0x4449444c0000
debug.print: (+268_435_456, 2_818_724_162)
debug.print: (+268_435_456, 2_533_732_672)
ingress Completed: Reply: 0x4449444c0000
debug.print: (+268_435_456, 2_818_572_610)
debug.print: (+268_435_456, 2_533_581_120)
ingress Completed: Reply: 0x4449444c0000
debug.print: (+268_435_456, 2_818_572_610)
debug.print: (+268_435_456, 2_533_581_120)
ingress Completed: Reply: 0x4449444c0000
4 changes: 2 additions & 2 deletions test/bench/ok/heap-32.drun-run.ok
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
ingress Completed: Reply: 0x4449444c016c01b3c4b1f204680100010a00000000000000000101
ingress Completed: Reply: 0x4449444c0000
debug.print: (50_227, +91_377_860, 1_435_538_669)
debug.print: (50_070, +102_586_000, 1_507_623_284)
debug.print: (50_227, +91_377_860, 1_433_012_379)
debug.print: (50_070, +102_586_000, 1_504_942_528)
ingress Completed: Reply: 0x4449444c0000
12 changes: 6 additions & 6 deletions test/bench/ok/palindrome.drun-run.ok
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
ingress Completed: Reply: 0x4449444c016c01b3c4b1f204680100010a00000000000000000101
ingress Completed: Reply: 0x4449444c0000
debug.print: (true, +1_188, 13_165)
debug.print: (false, +1_188, 12_450)
debug.print: (false, +1_188, 13_155)
debug.print: (true, +868, 12_841)
debug.print: (false, +868, 11_660)
debug.print: (false, +868, 12_800)
debug.print: (true, +1_188, 11_988)
debug.print: (false, +1_188, 11_273)
debug.print: (false, +1_188, 11_978)
debug.print: (true, +868, 11_936)
debug.print: (false, +868, 10_755)
debug.print: (false, +868, 11_895)
ingress Completed: Reply: 0x4449444c0000

0 comments on commit f37f9a9

Please sign in to comment.