> But Google doesn't use big endian CPUs in prod, so you've proven the point you...

> But Google doesn't use big endian CPUs in prod, so you've proven the point yourself.

The whole point is you just memcmp the whole key byte array instead of having to parse out the fields. If the timestamp is in little-endian order, then memcmp won't give you chronological ordering on the timestamp suffix.

In the case of pcmpeqb / pmovmskb / bsf, I was explicitly talking about memcmp, not finding the first same or differing byte. I don't want to get too bogged down in specific architectures vs. the more general pros/cons of byte orders themselves... but... there's pcmpeq and pcmpgt, but no pcmpne. That's fine, so to implement memcmp, you'd just xor with -1 before bsf to find the index of the first different byte and then explicitly compare the differing byte. So far so good, except that Intel explicitly designed AVX to scale up to 1024-bit vectors, at which pmovmskb can't pack 128 bits into a GP register. Fine, so use pcmpeqw/pmovmskw instead of pcmpeqb/pmovmskb, but then once you have the offset of the first differing uint16_t, you can't just do a native-endian comparison on the two uint16_ts. If the x86 were big-endian, then bsf with 64-bit GP registers would scale all the way up to 4096-bit vector registers without requiring an extra bit extractions or bswap instructions. It's fundamentally an advantage that a native-endian 64 bit subtraction performs an 8 byte lexographical comparison on big-endian machines.

Granted, the advantage is tiny, but this whole rabbit hole discussion was in reply to "I presume the __only__ advantage to BE today is compatibility then..." (emphasis mine) in [https://news.ycombinator.com/item?id=22660000]