Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anatomy of loop compilation involving a compiled function #1409

Open
alexandergall opened this issue Jan 21, 2019 · 2 comments
Open

Anatomy of loop compilation involving a compiled function #1409

alexandergall opened this issue Jan 21, 2019 · 2 comments

Comments

@alexandergall
Copy link
Contributor

This issue discusses in detail how the compiler treats a loop containing a compiled function which is sensitive to the ctype of its argument.

Consider the program https://github.com/alexandergall/snabbswitch/blob/funcf-loop-analysis/src/loops.lua (note that this is on a branch that disables JIT in program.snsh to make the results deterministic). The objects d1 and d2 will have different ctype IDs (1417 vs 1419).
We are interested in the compilation of the last loop with the condition that foo() has already been compiled when it is reached. Satisfying this condition is an interesting exercise by itself, so let's start with that.

At this point, it is necessary to understand how the compiler manages the hot counters for loops and functions. The system uses a single array of counters, which is initialized at https://github.com/alexandergall/snabbswitch/blob/771b55c829f42a1a788002c2924c6d7047cd1568/lib/luajit/src/lj_dispatch.c#L83

  int32_t hotloop = G2J(g)->param[JIT_P_hotloop];
  HotCount start = (HotCount)(hotloop*HOTCOUNT_LOOP - 1);
  HotCount *hotcount = G2GG(g)->hotcount;
  uint32_t i;
  for (i = 0; i < HOTCOUNT_SIZE; i++)
    hotcount[i] = start;

The parameters are defined in https://github.com/alexandergall/snabbswitch/blob/771b55c829f42a1a788002c2924c6d7047cd1568/lib/luajit/src/lj_dispatch.h#L69

/* Type of hot counter. Must match the code in the assembler VM. */
/* 16 bits are sufficient. Only 0.0015% overhead with maximum slot penalty. */
typedef uint16_t HotCount;

/* Number of hot counter hash table entries (must be a power of two). */
#define HOTCOUNT_SIZE		64
#define HOTCOUNT_PCMASK		((HOTCOUNT_SIZE-1)*sizeof(HotCount))

/* Hotcount decrements. */
#define HOTCOUNT_LOOP		2
#define HOTCOUNT_CALL		1

JIT_P_hotloop is the value of the LuaJIT runtime parameter hotloop (http://luajit.org/running.html) which is 56 by default. Therefore, all 64 hot counters are initialized to 56*2-1 = 111.

The code that decrements the counters can be found in the implementation of the bytecode instructions ITERN, FORL, ITERL, LOOP and FUNCF. In this example, we are only interested in FORL and FUNCF and we consider the 32-bit VM from https://github.com/alexandergall/snabbswitch/blob/master/lib/luajit/src/vm_x86.dasc, i.e.

|// Decrement hashed hotcount and trigger trace recorder if zero.
|.macro hotloop, reg
|  mov reg, PC
|  shr reg, 1
|  and reg, HOTCOUNT_PCMASK
|  sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_LOOP
|  jb ->vm_hotloop
|.endmacro
|
|.macro hotcall, reg
|  mov reg, PC
|  shr reg, 1
|  and reg, HOTCOUNT_PCMASK
|  sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_CALL
|  jb ->vm_hotcall
|.endmacro
  case BC_FORL:
    |.if JIT
    |  hotloop RB
    |.endif
 case BC_FUNCF:
    |.if JIT
    |  hotcall RB
    |.endif

We can see here that the PC of the bytecode instruction is used as an index into the hotcount table by right-shifting once and masking by 0x3F (i.e. byte codes don't have individual hot counters). We also see that the counters are decremented by 2 for a loop but only 1 for a call, which means that it actually takes 112 passes through FUNCF before the call becomes hot.

How can we achieve that the function foo() in our example program becomes hot? Naively, we might try

for i = 1, 112 do
  foo(d1, i)
end

Let's see what happens. For the first 55 iterations, the interpreter is executing all instructions and the hot counters are 1 and 56 for the loop and the function, respectively. With the next iteration, the counter for foo() is decremented once more (the loop body is executed before the FORL instruction) but the loop becomes hot and a root trace is starting to get recorded. Detection of hot loops and calls is disabled during recording and, of course, also during the execution of a compiled trace, which means that the hot counter for foo() remains at 55 not matter how many more iterations there are.

So, to achieve our goal, we have to add another loop with at least 56 iterations. Note that we have to chose at least 58 iterations to avoid the ultimate iteration abort syndrome #1393.

Let's run our program with

sudo ./snabb snsh -jdump=+rs,/tmp/dump.txt loops.lua

and look at the dump.txt

The first trace that got recorded starts at the loop on line 13

---- TRACE 3 start loops.lua:13
0024  MOV      8   3
0025  MOV      9   1
0026  MOV     10   7
0027  CALL     8   1   3
0000  . FUNCF    2          ; loops.lua:7
0001  . TSETS    1   0   0  ; "c"
0000  . . . FUNCC               ; ffi.meta.__newindex
0002  . RET0     0   1
0028  FORL     4 => 0024
---- TRACE 3 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #5    CI
0002       >  fun SLOAD  #4    T
0003 rax   >  cdt SLOAD  #2    T
0004       >  fun EQ     0002  loops.lua:7
0005 rcx      u16 FLOAD  0003  cdata.ctypeid
0006       >  int EQ     0005  +1417
0007          p64 ADD    0003  +8  
0008          int XSTORE 0007  0001
0009 rbp    + int ADD    0001  +1  
....              SNAP   #1   [ ---- ---- ---- ---- ---- ]
0010       >  int LE     0009  +100
....              SNAP   #2   [ ---- ---- ---- ---- ---- 0009 ---- ---- 0009 ]
0011 ------------ LOOP ------------
0012          int XSTORE 0007  0009
0013 rbp    + int ADD    0009  +1  
....              SNAP   #3   [ ---- ---- ---- ---- ---- ]
0014       >  int LE     0013  +100
0015 rbp      int PHI    0009  0013

Note that the bytecode for the function call is FUNCF and the ctype ID of the function argument is 1417.

The second loop on line 16 is interpreted for the first 56 iterations. At that point, just before the loop itself gets hot, the function`s hot counter goes to -1 and the function is compiled in trace #4

---- TRACE 4 start loops.lua:7
0001  TSETS    1   0   0  ; "c"
0000  . . FUNCC               ; ffi.meta.__newindex
0002  RET0     0   1
---- TRACE 4 IR
....              SNAP   #0   [ ---- ---- ---- ]
0001       >  num SLOAD  #2    T
0002 rbx   >  cdt SLOAD  #1    T
0003 rbp      u16 FLOAD  0002  cdata.ctypeid
0004       >  int EQ     0003  +1417
0005          p64 ADD    0002  +8  
0006 rbp      int CONV   0001  int.num
0007          int XSTORE 0005  0006
....              SNAP   #1   [ ---- ]

Again, note the ctype id 1417. Right after that, the loop itself is also compiled

---- TRACE 5 start loops.lua:16
0033  MOV      8   3
0034  MOV      9   1
0035  MOV     10   7
0036  CALL     8   1   3
0000  . JFUNCF   2   4         ; loops.lua:7
0001  . TSETS    1   0   0  ; "c"
0000  . . . FUNCC               ; ffi.meta.__newindex
0002  . RET0     0   1
0037  FORL     4 => 0033
---- TRACE 5 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #5    CI
0002       >  fun SLOAD  #4    T
0003 rax   >  cdt SLOAD  #2    T
0004       >  fun EQ     0002  loops.lua:7
0005 rcx      u16 FLOAD  0003  cdata.ctypeid
0006       >  int EQ     0005  +1417
0007          p64 ADD    0003  +8  
0008          int XSTORE 0007  0001
0009 rbp    + int ADD    0001  +1  
....              SNAP   #1   [ ---- ---- ---- ---- ---- ]
0010       >  int LE     0009  +100
....              SNAP   #2   [ ---- ---- ---- ---- ---- 0009 ---- ---- 0009 ]
0011 ------------ LOOP ------------
0012          int XSTORE 0007  0009
0013 rbp    + int ADD    0009  +1  
....              SNAP   #3   [ ---- ---- ---- ---- ---- ]
0014       >  int LE     0013  +100
0015 rbp      int PHI    0009  0013

Note how the bytecode of the function call has changed to JFUNFC due to the fact the the function is now compiled. However, the trace recorder ignores this fact and inlines the function call just like in the first loop.

Finally we get to the case when a new loop is encountered that calls the compiled function with a different argument. Already in the first iteration, the interpreter hits the JFUNCF bytecode and executes the compiled function. However, the argument d2 has a different ctype ID (1419) than what was used when the function was compiled. Therefore, trace #4 is immediately left via exit #0 and execution is resumed by the interpreter. This happens 10 times according to the runtime option hotexit before this side exit becomes hot and a new trace is started to be recorded.

Being a side trace, it is not allowed to contain a loop and the compiler is starting to unroll the loop up to the maximum given by the option loopunroll which is 15 by default. After that, the recorder aborts with

---- TRACE 6 abort loops.lua:21 -- loop unroll limit reached

This happens again 3 more times according to the option tryside. At this point, the loop has been executed 70 times (10 times to reach hotexit and 4*15 times for loopunroll). The compiler "balcklists" exit #0 of trace #4 by a fixed fallback to the interpreter

---- TRACE 6 start 4/0 loops.lua:8
---- TRACE 6 IR
....              SNAP   #0   [ ---- ---- ---- ]
....              SNAP   #1   [ ---- ---- ---- ]
---- TRACE 6 mcode 29
2a4feb29  mov dword [0x41e9e410], 0x6
2a4feb34  xor eax, eax
2a4feb36  mov ebx, 0x41de7d34
2a4feb3b  mov r14d, 0x41e9efe0
2a4feb41  jmp 0x00419179
---- TRACE 6 stop -> interpreter

From this point on, trace #4 will exit to the interpreter immediately, just like before the exit became hot.

What happened to the hot counter of the loop? It is decremented in the usual fashion whenever an iteration is interpreted, i.e. during the first 10 iterations and after the blacklisting but not during the failed recordings of the side trace. So we need 60 additional iterations to see the loop finally getting compiled. That would be 116 iterations, but then we would hit #1393 again, so we add two more. The loop is compiled in trace #7

---- TRACE 7 start loops.lua:20
0042  MOV      8   3
0043  MOV      9   2
0044  MOV     10   7
0045  CALL     8   1   3
0000  . JFUNCF   2   4         ; loops.lua:7
0001  . TSETS    1   0   0  ; "c"
0000  . . . FUNCC               ; ffi.meta.__newindex
0002  . RET0     0   1
0046  FORL     4 => 0042
---- TRACE 7 IR
....              SNAP   #0   [ ---- ]
0001 rbp      int SLOAD  #5    CI
0002       >  fun SLOAD  #4    T
0003 rax   >  cdt SLOAD  #3    T
0004       >  fun EQ     0002  loops.lua:7
0005 rcx      u16 FLOAD  0003  cdata.ctypeid
0006       >  int EQ     0005  +1419
0007          p64 ADD    0003  +8  
0008          int XSTORE 0007  0001
0009 rbp    + int ADD    0001  +1  
....              SNAP   #1   [ ---- ---- ---- ---- ---- ]
0010       >  int LE     0009  +118
....              SNAP   #2   [ ---- ---- ---- ---- ---- 0009 ---- ---- 0009 ]
0011 ------------ LOOP ------------
0012          int XSTORE 0007  0009
0013 rbp    + int ADD    0009  +1  
....              SNAP   #3   [ ---- ---- ---- ---- ---- ]
0014       >  int LE     0013  +118
0015 rbp      int PHI    0009  0013

Note again that the JFUNCF bytecode is ignored and the function is inlined. Due to that, the compiler is now specializing the code for ctype ID 1419, as it should.

At this point the loop is finally compiled and optimized. In this simple example, compilation of the function has simply delayed compilation of the last loop and nothing bad has happened. However, the systems has spent quite a bit more time in the interpreter than it would have if foo() hadn't been compiled for a different ctype ID. I suspect that this behavior can make a difference in a more complex application, where additional time spent in the interpreter can lead to suboptimal traces in other parts of the system.

@lukego
Copy link
Member

lukego commented Jan 21, 2019

Great case study and write-up!

@lukego
Copy link
Member

lukego commented Jan 21, 2019

Just fwiw, the simplified way that I read examples like this is to say:

  • The JIT will compile each inner loop separately.
  • Called functions are effectively copy-pasted (inlined) into the loop and then freshly compiled there.

So in this case

  • There are three for ... end loops.
  • These all qualify as inner loops because they don't call any function that contains another loop.
  • Each loop will be specialized separately: the first two for d1 and the third for d2.

So I'd expect to have three looping traces each specialized for one FFI ctype and then perhaps some "background noise" traces that compile individual functions or connections between loops.

I like this simplified view as a way to quickly sketch how the most important parts of a program - its inner loops - are likely to be compiled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants