Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
8-bit number to binary string (2013) (gynvael.coldwind.pl)
63 points by modinfo on May 25, 2022 | hide | past | favorite | 18 comments


Those constants are somewhat obfuscated.

3472328296227680304 == 0x3030303030303030

9241421688590303745 == 0x8040201008040201

72340172838076673 == 0x0101010101010101

... and suddenly things are a lot less mysterious.

Similar tricks were used for poor man's SIMD in one 32-bit register before MMX, Altivec etc. You could for example do just one multiply RGB 5-5-5 bit alpha blending instead of 3! ("Cooked" / premultiplied alpha of course.)


Hah. From the article:

> Of course, the last thing I did was changing the constants to decimal - they look way more scary this way ;>


Isn't that poor man SIMD an Abrash Black Book trick?


Perhaps, although a lot of people invented these tricks independently. I think some of the first ones have been known since seventies.


Thank you, that makes much more sense.


To truly understand this stuff... you have to study SIMD.

But wait, what SIMD instructions? No no no. The 64-bit register here is being used as a 64-way 1-bit SIMD.

You can see that addition and the "&" instructions are just simple SIMD. The multiplication instruction can be broken up into complex 64x bitshifts + 64x add instructions all into a single statement.

In general, this is called SWAR (SIMD within a register).

---------

The "popcount" trick is also SIMD/SWAR (bitshift and add the & 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a "reduction" / "prefix sum" applied over 1-bit values in a 32-bit or 64-bit register applied in a 1-bit SIMD as well.

------------

This achieves very fast results because its innately parallel. A 64-bit processor after all, can be seen as a 64-way parallel 1-bit processor. Learning the parallelization techniques, and applying them in this fashion leads to this kind of super-fast code.

Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x way gather" and "bitwise 64x way scatter" bit-instructions. The 64-bit register as a 64-way parallel computer is truly a marvelous abstraction.


Incidentally, I wish PEXT existed for other architectures. It's extremely useful for a chess engine technique called Magic Bitboards.


x86_64 no loop needed:

    movabsq $3472328296227680304, %rdx
    movzbl %dil, %eax
    movabsq $-9205322385119247871, %rdi
    imulq %rdi, %rax
    movabsq $72340172838076673, %rdi
    shrq $7, %rax
    andq %rdi, %rax
    addq %rdx, %rax
    movq %rax, (%rsi)
6502 with a loop:

    LDX #$08
  loop
    LSR $EB
    LDA #$30
    ADC #$00
    DEX
    STA result,X
    BNE loop


In the previous discussion someone lamented they wished they could do this in JavaScript[0].

Nowadays we have TextDecoder and BigInt64Array so I'm pretty sure we actually could do it[1][2].

I doubt it would be faster than calling <number>.toString(2), but in principle we could port this.

[0] https://news.ycombinator.com/item?id=10517332

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

[2] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


Here we go:

    const toBinary = (() => {
      const u64 = new BigUint64Array(1);
      const u8 = new Uint8Array(u64.buffer)
      u8[0] = 255;
      const decoder = new TextDecoder();
    
      return (number) => {
        u64[0] = 0n;
        u8[0] = number;  // We're assuming little-endian architectures
        u64[0] = (((u64[0] * 0x8040201008040201n) >> 7n) & 0x0101010101010101n) + 0x3030303030303030n
        return decoder.decode(u8)
      };
    })()


<number>.toString(2) Wouldn't be quite right because the output needs to be padded out with zeros.


Oh, right, good catch! So <number>.toString(2).padStart(8, "0") then. I still expect it to be faster than this clever hack though.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


It's a clever solution. For converting unsigned chars to strings, though, a lookup table is going to be the fastest solution, since there are only 256 possibilities.


Lookup tables hit the load/store units, IIRC only 2 loads per clock tick on modern Intel or AMD-Zen chips (though if anyone remembers, correct me if I'm wrong).

Modern bitshifts/add instructions can instead perform 3x per clock tick on Intel, and 4x per clock tick on AMD Zen.

Multiplication is 5-cycles of latency, but at least 1-per-clock tick, maybe 2-per-clock-tick depending on architecture. The /128 is just a "bitshift-right 7".

-------

EDIT: So you need 8x lookups (at least 4 clock ticks worth of resources, assuming the CPU can perform 2-lookups per clock tick) for your lookup table approach.

But this "*(unsigned long long*)out = 3472328296227680304ULL + (((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)" operation only has 3x bit operations + 1x multiply, using 2-clock ticks worth of resources.

The multiply and lookups have multiple clock-ticks of latency, but throughput tends to be the more important metric rather than latency in high-speed CPU code in my experience. (CPUs are very smart about rearranging code out-of-order to become throughput bound rather than latency bound)


Possibly fastest in a microbenchmark, but that's 2k of cache (at least) which the table will consume.


Are you sure? A cache miss might be very expensive.


The distribution of bits can also be achieved with a PDEP instruction, I wonder if that would make this somewhat practical on a modern CPU.


Wow that's really clever. I never considered that the 8 unicode bytes corresponding to the 8 bits of u8 would fit in a u64.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: