John Hawthorn

CRuby Experiment: Branch Free SPECIAL_CONST_P

In Ruby, everything is an object, even Integers.

CRuby uses a tagged pointer scheme, a memory and performance optimization which represents some common objects within its pointer itself without a separate memory allocation. This works by repurposing the low bits, which are otherwise always 0 due to alignment.

For example, if an object is an Integer (below a certain size), its VALUE will have its least significant bit set to 1 and the remaining 63 bits represent. Internally this is called a “Fixnum” and before Ruby 2.4, where Fixnum and Bignum were unified into Integer, that class was visible to userland.

A similar technique is used to encode some Symbols, Floats, and the values of Qtrue (Ruby’s true), Qfalse (Ruby’s false), Qnil (Ruby’s nil), and Qundef (a value that’s not visible to Ruby users).

You can see the specifics in special_consts.h

This is great! It means we can do basic integer math without allocating an object and without needing to access memory elsewhere.

SPECIAL_CONST_P

The cost we pay for the benefits of these tagged non-heap values is that we can’t assume that our VALUEs point to memory locations. Any time we want to check the object’s flags, class, or GC attributes, we must first determine whether or not the value is a “special constant”. The function to check this is SPECIAL_CONST_P.

As of Ruby 3.2 SPECIAL_CONST_P is defined as (slightly modified for readability)

inline bool SPECIAL_CONST_P(VALUE obj) { // return RB_IMMEDIATE_P(obj) || obj == RUBY_Qfalse; // return (obj & RUBY_IMMEDIATE_MASK) || obj == RUBY_Qfalse; return (obj & 7) || obj == 0; }

So objects with any of the 3 lowest bits set (Ruby’s internals call these “immediate values”) are special constants and Qfalse is a special constant. Again Qfalse is Ruby’s false, but it’s defined as 0 so it’s also C’s false.

This definition is new to Ruby 3.2, previously Qnil was also a special case. It was simplified in ruby/ruby#6759. Which made me think that maybe it could be done using bitwide operations and without branches.

Bitwise Tricks

There are two resources I really love for bitwise hacks: Sean Eron Anderson’s “Bit Twiddling Hacks” webpage and “Hacker’s Delight” by Henry S. Warren.

Hacker’s Delight is comprehensive. It starts by presenting basic techniques, identities and techniques which can be composed together. The author also made an easy to use superoptimizer, “Aha!” (A Hackers Assistant).

A superoptimizer is a program that exhaustively searches combinations of machine instructions to find the optimal sequence (by some criteria) which gives the desired output.

I used this fork of Aha! which targets ARM code. I gave it a version of the SPECIAL_CONST_P method, for reference, ran it, and it quickly output a solution:

$ ./aha 4
Searching for programs with 4 operations.

Found a 4-operation program:
   neg   r1,rx
   and   r2,r1,rx
   sub   r3,r2,#5
   shr   r4,r3,#31
   Expr: (((-(x) & x) - 5) >>u 31)

It seems to work! But what’s going on?

(x - 5) >> 31 is doing a subtraction, and moving the sign bit from the most significant position to the least significant. (Aha! runs the simulation with 32 bits). So that’s basically equivalent to (x < 5) or (x <= 4).

“Hacker’s Delight” gives us a technique to create a word with 1’s in all the positions of trailing 0’s. The following are equivalent:

(x - 1) & ~x
(x & -x) - 1
~(x | -x)

So in the case of a tagged immediate, we will get a result of either 0 (for a fixnum, with the 1 bit set, an odd number), 1 (bit 2 is set, an even number), or 3 (bit 3 is set, a multiple of 4). When we get a 0 the value we get has all bits set, which is -1 when interpreted as a signed integer (the only possible negative value from this operation). Using this we can perform a signed integer comparison on the output and anything <= 3 is an special constant!

(x & -x) - 1 <= 3

That’s not far from our expression from the superoptimizer, and we can just add 1 to both sides of the inequality to clean it up further. This also removes the need for a signed integer!

(x & -x) <= 4

We can test this expression in compiler explorer and see that it essentially replicates the assembly from the superoptimizer as well.

On x86_64 gcc -O3:

SPECIAL_CONST_P(long):
        mov     rax, rdi
        neg     rax
        and     rax, rdi
        cmp     rax, 4
        setle   al
        ret

On arm64 gcc -O3:

SPECIAL_CONST_P(long):
        neg     x1, x0
        and     x1, x1, x0
        cmp     x1, 4
        cset    w0, le
        ret

The compiler knows how to make this branch-free using cmp/setle or cmp/cset, which performs the equivalent of our previous >> 31 hack.

I’ve tested this inside CRuby and it passes all tests. It works!

https://github.com/jhawthorn/ruby/commit/f4d8686c54dd5fb727723c8cef50a0f634abf9e0

Branch free code

Now we have a branch free version of SPECIAL_CONST_P.

This in itself isn’t necessarily better, and we probably shouldn’t replace SPECIAL_CONST_P everywhere with this new version. Though it should be about the same speed we don’t normally use SPECIAL_CONST_P just to get a boolean 0 or 1 value, and compiler optimizations understand the old version better and can apply more optimizations when using it. Also branches aren’t bad, as long as they’re predictable (what is and isn’t “predictable” I find extremely difficult to have an intuition for as modern branch predictors are surprisingly sophisticated. Agner Fog’s page is a good place to read more).

Where I think there could be merit to this trick is composing it into a larger sequence of useful branch free code, especially in scenarios where the previous branches are unpredictable.

For example, a common operation in Ruby is to get the class of an object. Previously we needed branching to do this because the way we get a class on a heap object would crash if we tried it on a special constant.

Here’s the existing code

static inline VALUE rb_class_of(VALUE obj) { if (! RB_SPECIAL_CONST_P(obj)) { return RBASIC_CLASS(obj); // fetch from memory } else if (obj == RUBY_Qfalse) { return rb_cFalseClass; } else if (obj == RUBY_Qnil) { return rb_cNilClass; } else if (obj == RUBY_Qtrue) { return rb_cTrueClass; } else if (RB_FIXNUM_P(obj)) { return rb_cInteger; } else if (RB_STATIC_SYM_P(obj)) { return rb_cSymbol; } else if (RB_FLONUM_P(obj)) { return rb_cFloat; } }

The disassembly shows that it’s full of branches.

   test   dil,0x7
   jne    0x3f0b0 <rb_class_of+23>
   mov    rax,QWORD PTR [rip+0x4df422]        # 0x51e4c8 <rb_cFalseClass>
   test   rdi,rdi
   je     0x3f0ee <rb_class_of+85>
   mov    rax,QWORD PTR [rdi+0x8]
   ret
   cmp    rdi,0x4
   jne    0x3f0be <rb_class_of+37>
   mov    rax,QWORD PTR [rip+0x4df41b]        # 0x51e4d8 <rb_cNilClass>
   ret
   cmp    rdi,0x14
   jne    0x3f0cc <rb_class_of+51>
   mov    rax,QWORD PTR [rip+0x4df405]        # 0x51e4d0 <rb_cTrueClass>
   ret
   test   dil,0x1
   je     0x3f0da <rb_class_of+65>
   mov    rax,QWORD PTR [rip+0x4df337]        # 0x51e410 <rb_cInteger>
   ret
   mov    rax,QWORD PTR [rip+0x4df337]        # 0x51e418 <rb_cFloat>
   cmp    dil,0xc
   jne    0x3f0ee <rb_class_of+85>
   mov    rax,QWORD PTR [rip+0x4e008a]        # 0x51f178 <rb_cSymbol>
   ret

Instead what we can do is make a lookup table, mapping the last 3 bits (the tag bits) of a special constant value to its exact class.

static inline VALUE rb_class_of(VALUE obj) { if (! RB_SPECIAL_CONST_P(obj)) { return RBASIC_CLASS(obj); } else { VALUE klass = special_const_class_lookup[obj & 0x1f]; return klass; } }

Which looks like it has a branch but the compiler (at least clang, I found this unpredictable in GCC) emits this as a conditional mov.

   mov    rax,rdi
   neg    rax
   and    rax,rdi
   mov    ecx,edi
   and    ecx,0x1f
   shl    rcx,0x3
   add    rcx,QWORD PTR [rip+0x23dc0f]        # 0x4b3d78
   add    rdi,0x8
   cmp    rax,0x5
   cmovl  rdi,rcx
   mov    rax,QWORD PTR [rdi]
   ret

Shorter overall and no branches, but execution may take fewer instructions in the original version, depending on the path taken.

https://github.com/jhawthorn/ruby/commit/23155ce588dbb9eaaf3abb6e0391ea2f9f959167

Not too surprisingly, this version didn’t end up being an improvement when swapped out completely in CRuby. Once again I think the old version being friendlier to compiler optimizations, avoiding memory access, and being shorter in terms of execution wins out.

What use is this?

Right now nothing that I’ve seen 😅. It’s a nifty trick and I’m glad to have found it.

I think it’s still possible there’s a scenario where we have a very unpredictable branch where this would be more beneficial than the branching code. Perhaps something like GC marking, where we only care whether or not an object is on the heap.

It’s something I’ll keep in mind, but for now it’s just for fun.