Playing with ruby's new JIT: MJIT

Just-in-time is an illusion. Albert Einstein

This week, Takashi Kokubun (@k0kubun) merged the first implementation of a JIT compiler in MRI ruby.

k0kubun @k0kubun
I've just committed the initial JIT compiler for Ruby. It's not still so fast yet (especially it's performing badly with Rails for now), but we have much time to improve it until Ruby 2.6 (or 3.0) release. github.com/ruby/ruby/commit/ed935a…
3:27 AM - 4 Feb 2018

As the commit explains, this is still early days for JIT in MRI ruby. It’s not yet ready to make Rails faster, and it’s slower right now than some of the earlier prototypes, but it’s here.

I’m really excited about this. Ruby has an (only partly deserved) reputation for being slow. A JIT has the potential to solve this.

Despite its experimental state, I couldn’t wait to give it a go.

I cloned down MRI trunk, built it, and installed it to my ~/.rubies directory

$ git clone https://github.com/ruby/ruby
$ cd ruby
$ autoconf
$ ./configure --prefix=$HOME/.rubies/ruby-2.6.0-dev
$ make
$ make install

then switched to the new version.

$ chruby 2.6
$ ruby -v
ruby 2.6.0dev (2018-02-05 trunk 62211) [x86_64-linux]

I tested it out with my Advent of Code 2017 day 15 solution (spoilers below).

def calculate(a, b, n = 40_000_000)
  n.times.count do
    a = a * 16807 % 2147483647
    b = b * 48271 % 2147483647

    (a & 0xffff) == (b & 0xffff)
  end
end

raise unless calculate(65, 8921) == 588
p result: calculate(699, 124)

This should be an ideal candidate for ruby’s JIT. It’s not calling out to other expensive methods. It’s just doing math and should be mostly bound by interpreter speed.

$ time ruby --disable-gems 15a.rb
{:result=>600}
  8.30s user
  0.00s system
   100% cpu
  8.306 total

$ time ruby --disable-gems --jit 15a.rb
{:result=>600}
  6.42s user
  0.03s system
   105% cpu
  6.132 total

It works! We went to 6.1 seconds from 8.3 seconds seconds just by enabling JIT.

MJIT internals

I’m not sure, but I believe that MJIT’s approach to be somewhat unconventional. From comments in mjit.c:

We utilize widely used C compilers (GCC and LLVM Clang) to implement MJIT. We feed them a C code generated from ISEQ. The industrial C compilers are slower than regular JIT engines. Generated code performance of the used C compilers has a higher priority over the compilation speed.

MJIT takes a block of ruby’s YARV bytecode and converts it into what is basically an inlined version of the C code it would have run when interpreting it.

In some ways, this is the same as what other JITs do: they compile bytecode into machine code at runtime. I don’t know of another JIT which so directly shells out to an off-the-shelf C compiler.

I think I like it.

It works, for one thing. It feels nice and UNIX-y. Best of all, it makes the JIT very inspectable.

Let’s see what exactly it’s doing

$ ruby --disable-gems --jit --jit-verbose=2 --jit-save-temps 15a.rb
Starting process: gcc -O2 [...] /tmp/_ruby_mjitp18966u0.c -o /tmp/_ruby_mjitp18966u0.so
JIT success (133.1ms): block in calculate@15a.rb:2 -> /tmp/_ruby_mjitp18966u0.c
{:result=>600}

$ ls -l /tmp/_ruby_mjitp18966u0.*
-rw-r--r-- 1 jhawthorn users  8765 Feb  5 22:18 /tmp/_ruby_mjitp18966u0.c
-rwxr-xr-x 1 jhawthorn users 64384 Feb  5 22:18 /tmp/_ruby_mjitp18966u0.so

MJIT has taken our “hot” block, the inner loop of our calculate function, and

Converted it to C
Written that C to /tmp/_ruby_mjitp18966u0.c
Used GCC (or clang) to compile that to /tmp/_ruby_mjitp18966u0.so
Dynamically loaded that shared library to run it

Side note: The JIT run took 105% CPU, even though the ruby code is single threaded. A second thread runs the JIT compiler. That thread, and the GCC process it spawned, are responsible for that extra 5% which would have run on a different CPU core. Neat!

Let’s take a look at the generated C code (full file as gist):

#include "/tmp/_mjit_hp18966u0.h"
/* block in calculate@15a.rb:2 */

VALUE _mjit0(rb_execution_context_t *ec, rb_control_frame_t *reg_cfp) {
  VALUE *stack = reg_cfp->sp;
  if (reg_cfp->pc != 0x561e21fc1110) {
    return Qundef;
  }

label_0: /* nop */
{
    reg_cfp->pc = (VALUE *)0x561e21fc1110;
    reg_cfp->sp = reg_cfp->bp + 1;
    {
        /* none */
    }
}

label_1: /* getlocal_WC_1 */
{
    MAYBE_UNUSED(VALUE) val;
    MAYBE_UNUSED(lindex_t) idx;
    MAYBE_UNUSED(rb_num_t) level;
    level = 1;
    idx = (lindex_t)0x4;
    reg_cfp->pc = (VALUE *)0x561e21fc1118;
    reg_cfp->sp = reg_cfp->bp + 1;
    {
        val = *(vm_get_ep(GET_EP(), level) - idx);
        RB_DEBUG_COUNTER_INC(lvar_get);
        (void)RB_DEBUG_COUNTER_INC_IF(lvar_get_dynamic, level > 0);
    }
    stack[0] = val;
}

label_3: /* putobject */
{
    MAYBE_UNUSED(VALUE) val;
    val = (VALUE)0x834f;
    reg_cfp->pc = (VALUE *)0x561e21fc1128;
    reg_cfp->sp = reg_cfp->bp + 2;
    {
        /* */
    }
    stack[1] = val;
}

label_5: /* opt_mult */
{
    MAYBE_UNUSED(CALL_CACHE) cc;
    MAYBE_UNUSED(CALL_INFO) ci;
    MAYBE_UNUSED(VALUE) obj, recv, val;
    ci = (CALL_INFO)0x561e21fc13a0;
    cc = (CALL_CACHE)0x561e21fc1420;
    recv = stack[0];
    obj = stack[1];
    reg_cfp->pc = (VALUE *)0x561e21fc1138;
    reg_cfp->sp = reg_cfp->bp + 3;
    {
        val = vm_opt_mult(recv, obj);

        if (val == Qundef) {
            return Qundef; /* cancel JIT */
        }
    }
    stack[0] = val;
}

/* 300 more lines... */

The comments correspond to the YARV instructions being compiled. Each instruction manipulates the stack, program counter, and the rest of ruby’s VM in exactly the way the interpreter, vm_exec_core would have.

The section I’ve included here gets all the way through the first multiplication:

Get the local variable a, which is pushed on the stack
Push our multiplicand, 16807, on the stack (represented by its object_id 0x834f)
Call vm_opt_mult with the 2 values on the stack.

One great thing about this just how good C compilers are. It will do its best to inline the methods from ruby’s internals (like vm_get_ep and vm_opt_mult) being called here. It will avoid assignments to the stack and other memory locations if it can infer that they aren’t needed, or if it can just assign the final written value. So it should do a reasonable job even with this simple implementation.

Changing the ruby code to JIT better?

It’s way to early to change any ruby code to be more JIT friendly. MJIT could look totally different in a few months.

But as long as its just for fun, I think I will indulge.

With our original ruby code, because #times and #count are written in C, and are separate method calls, they aren’t being optimized together with our JIT’d internal block. By writing less idiomatic ruby, we can JIT the outside of the loop as well.

def calculate(a, b, n = 40_000_000)
  i = 0
  c = 0
  while i < n
    a = a * 16807 % 2147483647
    b = b * 48271 % 2147483647

    c += 1 if (a & 0xffff) == (b & 0xffff)
    i += 1
  end
  c
end

Without JIT:

$ time ruby --disable-gems 15a.rb
{:result=>600}
ruby --disable-gems 15a.rb  6.80s user 0.00s system 99% cpu 6.803 total

With JIT:

$ time ruby --jit --jit-verbose=1 15a.rb
{:result=>600}
  6.85s user
  0.02s system
   103% cpu
  6.862 total

What happened?

Not only is this slower than the non-JIT version. It’s slower than our original code, even though the while loop is faster when not using JIT.

MJIT didn’t know to optimize this method. It’s using a pretty naive heuristic to determine what methods to optimize (which could change in the future).

$ ruby --help
...
MJIT options (experimental):
  --jit-min-calls=num
                  Number of calls to trigger JIT (for testing, default: 5)

MJIT optimizes functions which are called more than 5 times, but we’re only calling calculate twice. Previously it knew to optimize the inner block before because it was being called millions of times.

It would eventually figure this out on a long running process like a web server. I suspect even 5 times is quite low and should eventually be raised.

As long as this is just for fun, we can cheat and telegraph to MJIT that we would like it very much if it compiled our method.

# HACK THE PLANET!
# Call calculate 5 times (with reduced iterations)
# so that MJIT thinks it's hot, and will optimize it.
5.times { calculate(0,0,1) }

raise unless calculate(65, 8921) == 588
p calculate(699, 124)

The next problem is that MJIT is asynchronous and runs in another thread. It will start just-in-time compiling after the running calculate the first 5 times, but won’t be finished before the first real call to calculate.

It’s not just-in-time enough! It’s just-too-late!

There’s a command line option --jit-wait to work around that. It will make MJIT synchronous and finish that compilation before we move on. In most cases it will hurt performance instead of helping, but for this simple script it does exactly what we need.

$ time ruby --disable-gems --jit --jit-wait --jit-verbose=1 15a.rb
JIT success (59.8ms): block in <main>@15a.rb:14 -> /tmp/_ruby_mjitp12605u0.c
JIT success (202.6ms): calculate@15a.rb:1 -> /tmp/_ruby_mjitp12605u1.c
{:result=>600}
  3.93s user
  0.03s system
   100% cpu
  3.957 total

It’s way too soon to do this for real code, but I hope it’s a first glimpse of what performance in ruby is going to look like.

I’m really excited to see where MJIT goes in the future.

Playing with ruby's new JIT: MJIT

MJIT internals

Changing the ruby code to JIT better?

Further reading