... and suddenly things are a lot less mysterious.
Similar tricks were used for poor man's SIMD in one 32-bit register before MMX, Altivec etc. You could for example do just one multiply RGB 5-5-5 bit alpha blending instead of 3! ("Cooked" / premultiplied alpha of course.)
To truly understand this stuff... you have to study SIMD.
But wait, what SIMD instructions? No no no. The 64-bit register here is being used as a 64-way 1-bit SIMD.
You can see that addition and the "&" instructions are just simple SIMD. The multiplication instruction can be broken up into complex 64x bitshifts + 64x add instructions all into a single statement.
In general, this is called SWAR (SIMD within a register).
---------
The "popcount" trick is also SIMD/SWAR (bitshift and add the & 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a "reduction" / "prefix sum" applied over 1-bit values in a 32-bit or 64-bit register applied in a 1-bit SIMD as well.
------------
This achieves very fast results because its innately parallel. A 64-bit processor after all, can be seen as a 64-way parallel 1-bit processor. Learning the parallelization techniques, and applying them in this fashion leads to this kind of super-fast code.
Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x way gather" and "bitwise 64x way scatter" bit-instructions. The 64-bit register as a 64-way parallel computer is truly a marvelous abstraction.
It's a clever solution. For converting unsigned chars to strings, though, a lookup table is going to be the fastest solution, since there are only 256 possibilities.
Lookup tables hit the load/store units, IIRC only 2 loads per clock tick on modern Intel or AMD-Zen chips (though if anyone remembers, correct me if I'm wrong).
Modern bitshifts/add instructions can instead perform 3x per clock tick on Intel, and 4x per clock tick on AMD Zen.
Multiplication is 5-cycles of latency, but at least 1-per-clock tick, maybe 2-per-clock-tick depending on architecture. The /128 is just a "bitshift-right 7".
-------
EDIT: So you need 8x lookups (at least 4 clock ticks worth of resources, assuming the CPU can perform 2-lookups per clock tick) for your lookup table approach.
But this "*(unsigned long long*)out = 3472328296227680304ULL +
(((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)" operation only has 3x bit operations + 1x multiply, using 2-clock ticks worth of resources.
The multiply and lookups have multiple clock-ticks of latency, but throughput tends to be the more important metric rather than latency in high-speed CPU code in my experience. (CPUs are very smart about rearranging code out-of-order to become throughput bound rather than latency bound)
3472328296227680304 == 0x3030303030303030
9241421688590303745 == 0x8040201008040201
72340172838076673 == 0x0101010101010101
... and suddenly things are a lot less mysterious.
Similar tricks were used for poor man's SIMD in one 32-bit register before MMX, Altivec etc. You could for example do just one multiply RGB 5-5-5 bit alpha blending instead of 3! ("Cooked" / premultiplied alpha of course.)