8-bit number to binary string (2013)

vardump · on May 25, 2022

Those constants are somewhat obfuscated.

3472328296227680304 == 0x3030303030303030

9241421688590303745 == 0x8040201008040201

72340172838076673 == 0x0101010101010101

... and suddenly things are a lot less mysterious.

Similar tricks were used for poor man's SIMD in one 32-bit register before MMX, Altivec etc. You could for example do just one multiply RGB 5-5-5 bit alpha blending instead of 3! ("Cooked" / premultiplied alpha of course.)

Twisol · on May 25, 2022

Hah. From the article:

> Of course, the last thing I did was changing the constants to decimal - they look way more scary this way ;>

glouwbug · on May 25, 2022

Isn't that poor man SIMD an Abrash Black Book trick?

vardump · on May 25, 2022

Perhaps, although a lot of people invented these tricks independently. I think some of the first ones have been known since seventies.

AdamH12113 · on May 25, 2022

Thank you, that makes much more sense.

dragontamer · on May 25, 2022

To truly understand this stuff... you have to study SIMD.

But wait, what SIMD instructions? No no no. The 64-bit register here is being used as a 64-way 1-bit SIMD.

You can see that addition and the "&" instructions are just simple SIMD. The multiplication instruction can be broken up into complex 64x bitshifts + 64x add instructions all into a single statement.

In general, this is called SWAR (SIMD within a register).

---------

The "popcount" trick is also SIMD/SWAR (bitshift and add the & 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a "reduction" / "prefix sum" applied over 1-bit values in a 32-bit or 64-bit register applied in a 1-bit SIMD as well.

------------

This achieves very fast results because its innately parallel. A 64-bit processor after all, can be seen as a 64-way parallel 1-bit processor. Learning the parallelization techniques, and applying them in this fashion leads to this kind of super-fast code.

Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x way gather" and "bitwise 64x way scatter" bit-instructions. The 64-bit register as a 64-way parallel computer is truly a marvelous abstraction.

stouset · on May 26, 2022

Incidentally, I wish PEXT existed for other architectures. It's extremely useful for a chess engine technique called Magic Bitboards.

mmphosis · on May 25, 2022

x86_64 no loop needed:

    movabsq $3472328296227680304, %rdx
    movzbl %dil, %eax
    movabsq $-9205322385119247871, %rdi
    imulq %rdi, %rax
    movabsq $72340172838076673, %rdi
    shrq $7, %rax
    andq %rdi, %rax
    addq %rdx, %rax
    movq %rax, (%rsi)

6502 with a loop:

    LDX #$08
  loop
    LSR $EB
    LDA #$30
    ADC #$00
    DEX
    STA result,X
    BNE loop

vanderZwan · on May 25, 2022

In the previous discussion someone lamented they wished they could do this in JavaScript[0].

Nowadays we have TextDecoder and BigInt64Array so I'm pretty sure we actually could do it[1][2].

I doubt it would be faster than calling <number>.toString(2), but in principle we could port this.

[0] https://news.ycombinator.com/item?id=10517332

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

[2] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

vanderZwan · on May 27, 2022

Here we go:

    const toBinary = (() => {
      const u64 = new BigUint64Array(1);
      const u8 = new Uint8Array(u64.buffer)
      u8[0] = 255;
      const decoder = new TextDecoder();
    
      return (number) => {
        u64[0] = 0n;
        u8[0] = number;  // We're assuming little-endian architectures
        u64[0] = (((u64[0] * 0x8040201008040201n) >> 7n) & 0x0101010101010101n) + 0x3030303030303030n
        return decoder.decode(u8)
      };
    })()

Mr_Modulo · on May 25, 2022

<number>.toString(2) Wouldn't be quite right because the output needs to be padded out with zeros.

vanderZwan · on May 25, 2022

Oh, right, good catch! So <number>.toString(2).padStart(8, "0") then. I still expect it to be faster than this clever hack though.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

sfblah · on May 25, 2022

It's a clever solution. For converting unsigned chars to strings, though, a lookup table is going to be the fastest solution, since there are only 256 possibilities.

dragontamer · on May 25, 2022

Lookup tables hit the load/store units, IIRC only 2 loads per clock tick on modern Intel or AMD-Zen chips (though if anyone remembers, correct me if I'm wrong).

Modern bitshifts/add instructions can instead perform 3x per clock tick on Intel, and 4x per clock tick on AMD Zen.

Multiplication is 5-cycles of latency, but at least 1-per-clock tick, maybe 2-per-clock-tick depending on architecture. The /128 is just a "bitshift-right 7".

-------

EDIT: So you need 8x lookups (at least 4 clock ticks worth of resources, assuming the CPU can perform 2-lookups per clock tick) for your lookup table approach.

But this "*(unsigned long long*)out = 3472328296227680304ULL + (((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)" operation only has 3x bit operations + 1x multiply, using 2-clock ticks worth of resources.

The multiply and lookups have multiple clock-ticks of latency, but throughput tends to be the more important metric rather than latency in high-speed CPU code in my experience. (CPUs are very smart about rearranging code out-of-order to become throughput bound rather than latency bound)

userbinator · on May 25, 2022

Possibly fastest in a microbenchmark, but that's 2k of cache (at least) which the table will consume.

mortenlarsen · on May 25, 2022

Are you sure? A cache miss might be very expensive.

leni536 · on May 25, 2022

The distribution of bits can also be achieved with a PDEP instruction, I wonder if that would make this somewhat practical on a modern CPU.

pitaj · on May 25, 2022

Wow that's really clever. I never considered that the 8 unicode bytes corresponding to the 8 bits of u8 would fit in a u64.