One thing directly connected to this history is why the x86 is little-endian. As...

Vogtinator · on March 22, 2020

I don't think that really forced any particular endianness.

Endianness is only really a thing if there's addressable memory, which is not the case for a bit stream. Whether a system is big/little/mixed-endian was only a question of how the address lines are set for multi-byte accesses.

tomxor · on March 22, 2020

I thought that too at first, but then I found this stack-overflow answer fairly convincingly confirming the OPs claim with quotes from it's designers: https://stackoverflow.com/a/36263273 although it's still not explicitly clear why.

> Shustek: So, for example, storing numbers least significant byte first, came from the fact that this was _serial_ and you needed to process the low bits first.

> Poor: You had to do it that way. You had no choice.

I'm wondering if it was the bit order that mattered for compatibility with the 1-bit serial processor of the Datapoint 2200, and if that determined byte ordering of it's 16bit successor?... I don't know much about 1-bit computing, but if you think about it - to be useful, it must be capable of doing operations on arbitrary length words (within some reasonable limitation), i.e not limited to 8 bits like the 8008. So maybe this put it in a funny position: while byte endianess makes no sense and bit endianness seem to make no difference for the 8008 since it was 8bit; it did make a difference to the datapoint which was serial 1-bit and could operate through the 8bit word boundaries.

To be clear what i mean, imagine how a 1-bit ALU would process 16 bits, it would want them in order like this (the bits are in logical order, not numerical order):

    (8bit) Little Endian:
    byte 0000000011111111
    bit  0123456789abcdef

But not like this:

    (8bit) Big Endian:
    byte 0000000011111111
    bit  76543210fedcba98

I don't know if this is correct, and i'm assuming here that bit endianness must determine byte endianness in the 16bit 8086. Clarification/correction is most welcome.

Gibbon1 · on March 23, 2020

Assume you are adding two words one bit at a time with one bit of storage for the carry. You can do that with LE but not BE.

tomxor · on March 23, 2020

Can you please give a bit more detail? what I don't quite understand is why it would be a problem to do so in reverse bit order?

I'm assuming you are saying it's not about word boundaries? that was just my guess, but is it correct that bit order does matter?

Gibbon1 · on March 23, 2020

When you add two binary numbers the simplest method for handling the carry is a ripple carry. The carry bit ripples up from lsb to msb.

If you're trying to build a simple 1 bit ALU you only need to save the carry bit as long as you add the bits from least significant to most. if you do it the reverse you have an intermediate result that needs to be saved.

One bit ALU adds like

   a0 + b0 => r0 + carry0
   a1 + b1 + carry0 => r1 + carry1
   a2 + b2 + carry1 => r1 + carry2
   ...
   an + bn + carry(n-1) => r1 + carryn

As you can see you only need to keep the carry bit. The result bit gets written back to memory immediately.

What's going on is a 1 bit processor represents the most extreme trade off between gate count and speed for a given functionality. AKA low gate count but requires many clock cycles per operation.

tomxor · on March 24, 2020

> as long as you add the bits from least significant to most

Thanks, although I already understood what you explained, this bit made me realize more clearly the detail I was tripping up on. I was never imagining the 1-bit adder to start anywhere _but_ the LSB, what I couldn't figure out was why it couldn't simply iterate the bits in reverse logical order for big endian...

I'm guessing there is a good reason why serial processors can't read a sequence of bits in reverse logical order, and it probably applies to modern CPUs too? either that or it's a design choice that cannot be dynamic.

I should probably pick a book up at this point, thanks for answering my questions.

JoeAltmaier · on March 23, 2020

The carry between bytes can be applied as-you-go with LE but has to be done after with BE. Which would double the execution time (adding zero/1 to each byte of the result).

ajross · on March 22, 2020

Pretty much all simple multi-word ALU algorithms need to start with the low word first. That particular terminal wasn't remotely unique (though for all I know, it was indeed the specific product that drove the Intel design decision), nor was x86 the first LE architecture. The PDP/11 made the same decision for the same reason several years earlier (and the VAX followed for pseudocompatibility), etc...

Really, it wasn't until the mid-80's and the RISC revolution, where all of a sudden people found themselves designing systems that would be 32 bits wide from the very first silicon that the community "decided" that the only true byte order should be BE.

And of course, the reason they made that decision was simply that the order of bytes in memory happened to match the way readers of the latin alphabet write arabic numbers on paper. Had the arabs invented computing, there would never even have been a debate.

KMag · on March 22, 2020

On the other hand, unsigned big-endian numbers sort lexographically and numerically the same way. So, for instance, if you're using a b+ tree to store time series of key-value pairs, you can use tuples of <key, big_endian_timestamp> as your b+ tree keys and use memcmp() on the raw bytes of your keys.

For variable-length integers, using a UTF-8-like big-endian encoding allows you to keep the relationship between lexographic and numeric sorting.

This is also a huge advantage of ISO 8601 over other date formats.

projektfu · on March 22, 2020

The mainframes that were byte-addressable (e.g. System/360) were big endian generally. I think this property led to the Motorola decision to use big endian on the 68000 and, because of its popularity in workstations, the RISC vendors followed suit. MIPS chose to make the endianness selectable by the hardware vendor, and that has become common, with little endian wining these days in ARM implementations.

greenyoda · on March 22, 2020

> Really, it wasn't until the mid-80's and the RISC revolution, where all of a sudden people found themselves designing systems that would be 32 bits wide from the very first silicon that the community "decided" that the only true byte order should be BE.

IBM 360-series mainframes were 32-bit and big-endian in the 1960s. (Their earlier computers may have been big-endian too, I'm not sure.)

cesarb · on March 22, 2020

> (Their earlier computers may have been big-endian too, I'm not sure.)

Yes, their choice of big endian probably came from their earlier unit record equipment, which used punched cards as input, storage, and output. Since punched cards were (sort of) directly human readable, it was natural to use big endian.

KMag · on March 23, 2020

Also, making punch cards bid-endian meant that BCD integers sorted the same way lexographically and numerically, so there wasn't any special sorting mode necessary for numeric fields in punch card sorting machines.

nineteen999 · on March 23, 2020

Indeed, and the M68K (release in 1979) is big-endian as well.

userbinator · on March 22, 2020

I've always had it memorised that LE is the "logical" order and BE the "backwards" order, specifically for this reason: with LE, the bit at offset n has value 2^n, and the byte at offset n has value 256^n. With BE it depends on how wide the total value is, which introduces an awkward length-dependent term.

Practically all multi-precision arithemtic libraries store the segments of a bignum in LE order too, despite the fact that the individual segments may be BE on a BE machine. That is even more confusing.

gumby · on March 23, 2020

That misunderstands history. The earlier DEC machines (e.g. PDP-1/6/10/20 were all BE (as were Multics and mainframes of the time); the PDP-10 is why network ordering is BE.

And while I consider the PDP-6 to be the first RISC architecture, really that title goes to 1970s machines like the 801.

jart · on March 22, 2020

Little endian is natural memory ordering and was specified in Von Neumman's EDVAC design doc during world war ii, so the question isn't so much, "why little endian?" but rather, "why not?" Only reason computers were ever designed to use "traditional" ordering is because business mainframes loved BCD and they loved to code GUIs that just blit'd raw fixed memory onto the display driver.

KMag · on March 23, 2020

As I mentioned elsewhere, the fact that big-endian positive integers (and IEEE-754 normalized floats/doubles) sort the same lexographically and numerically is useful. The same goes for UTF-8-like big-endian encodings for variable length integers and ISO 8601-formatted dates / times. It's handy to be able to use memcmp() on compound b+ tree keys.

The advantage works both ways: it's also nice if your normal integer vector instructions can be used to accelerate memcmp, without needing an extra opcode or instruction flag to specifically support parallel lexographical comparison. Note that if the vector units only support signed integers, then you only need an exclusive-or to flip all of the sign bits in order to get lexographical sorting, and that vector exclusive-or has important real-word use cases for hash functions and so-called add-shift-xor encryption algorithms like ChaCha20.

Alternatively, if you're going to include a dedicated lexographical comparison instruction for strcmp/memcmp (like x86 REP CMPS), a fast implementation will need less dedicated circuitry if your processor natively supports big-endian operations for other instructions.

Big-endian use in mainframes likely evolved from big-endian BCD numeric fields on punch card sorting/tabulating machines, where big-endianness was an advantage in that the normal lexographic sorting would put constant-width numeric fields in numerical order.

jart · on March 23, 2020

SWAR techniques are orthogonal to endianness. Read Hacker's Delight; it mentions big endian only once, to explain how to convert away from it. Describing variable-length encoding as big-endian-like is quite a creative stretch. Xor does not flip two's complement sign correctly, except in a few special cases like YCbCr. Please read the EDVAC design doc, since Von Neumman invented two's complement too. His ideas are the reason why programmers have never needed to care much about whether it's signed or unsigned, since the machine arithmetic will behave exactly the same way (with few exceptions).

KMag · on March 24, 2020

> Describing variable-length encoding as big-endian-like is quite a creative stretch.

LEB128 is a little-endian variable-length encoding. That's what the LE stands for.

UTF-8 is a big-endian variable-length encoding. The most significant bits are packed in the earlier bytes, so that lexographic ordering comes out correctly.

I'm really alluding to something like LEB128-encoding uint64_ts, except big-endian and putting all of the continuation flags at the beginning of the encoding, so a single switch on the first byte gives you the encoded length and it also sorts correctly lexographically.

> Xor does not flip two's complement sign correctly

Re-read what I wrote. I wasn't talking about flipping the sign bit in order to negate the numeric value. I was talking about flipping the sign bit to get correct lexographic text ordering on a machine that can only do signed comparisons.

> since the machine arithmetic will behave exactly the same way (with few exceptions).

I was talking about _exactly_ one of these exceptions... comparing two values. For simplicity, let's pretend we have a 16-bit BE CPU that can only perform signed comparisons, and we want to lexographically compare two 4-byte arrays: [ 41 4F 4F 4F ] vs. [ C1 81 4F 4F ] we perform the first load to compare 0x414F (16719) vs. 0xC181 (-15999) in order to get the correct lexographical ordering in all cases, we need to invert the sense of the sign bit. Flipping the sign bit in these two cases gives 0xC14F (-16049) vs 0x8181 (33153).

In all cases, flipping the sign bit allows one to perform unsigned comparison on a CPU that can only perform signed comparisons. For an W-bit word, you end up subtracting (W-1)^2 from all values with the most significant bit unset and adding (W-1)^2 to all values with the most significant bit set. I understand this doesn't negate the values. That's not the point. The point is to perform unsigned comparison on a CPU that doesn't natively support it.

tomxor · on March 23, 2020

> Only reason computers were ever designed to use "traditional" ordering is because business mainframes loved BCD and they loved to code GUIs that just blit'd raw fixed memory onto the display driver.

Hah, so it's literally because we write numbers with most significant digit on the left, which is the opposite to logical ordering?

I presume the only advantage to BE today is compatibility then...

gumby · on March 23, 2020

“We” being people with left-to-right languages. All rtl languages I know of write numbers the same as ltr languages, i.e. in the rtl case little endian.

As you have to read an entire positional number to be able to get even its magnitude; the only “advantage” I can see of LE over BE in natural languages is that at least rtl can get odd/even immediately. Then again in ltr you can get sign immediately. In real life human use neither “advantage” is compelling.

KMag · on March 23, 2020

If you want your (vector or scalar) integer instructions to be useful for fast strcmp/memcmp implementations, then you'll want to support unsigned big-endian integers. Also, if you want to mix integers and strings in compound b+ tree keys and use memcmp to compare keys, you'll want to store your integers big-endian.

jart · on March 23, 2020

I don't believe you. That sounds more like bitreverse and morton interleaving. If you're serious that it's actually big endian and that it has a real use case (outside human social traditions and legacy infrastructure maintainence), then I'd love to see convincing details.

As for strings, GCC/Clang are great at vectorizing normal byte-oriented code. It doesn't matter if it generates or you hand code VPCMPEQB vs. VPCMPEQQ either, since they appear to have the exact same latency. Modern compilers have also been moving in the direction of punishing folks who do things like `* (uint32_t * )p`, due to aliasing rules rather than endianness. Rob Pike had an amusing blog post on this subject: https://commandcenter.blogspot.com/2012/04/byte-order-fallac...

KMag · on March 24, 2020

> I don't believe you. That sounds more like bitreverse and morton interleaving. If you're serious that it's actually big endian and that it has a real use case (outside human social traditions and legacy infrastructure maintainence), then I'd love to see convincing details.

Back when I worked on Google's indexing system (2006-2010), one of the stages keyed documents by URL (with the host name in DNS order: com.google.www) followed by a big-endian timestamp. This put records for the same domain together, and put successive crawls for the same URL in chronological order. This improved compression ratios in our BigTable tablets, and finding the latest crawled version of <URL> was just a search for the largest key <= <URL><MAX_TIMESTAMP>.

> It doesn't matter if it generates or you hand code VPCMPEQB vs. VPCMPEQQ either, since they appear to have the exact same latency.

But with VPCMPEQB, you need to have 8 times as many conditional branches as CPCMPEQQ (in the naive case) in order to turn your vectorized comparison into a memcmp result. Note that VCMPEQ* don't modify any flags, so you can't just JE/JNE after your VCMPEQ*. Now, via some vector permutes, shifts, and bitwise-ors, you can cut down on the number of conditional branches. However, it's still more processing than if you can load / compare your vector registers in lexographical order.

jart · on March 24, 2020

But Google doesn't use big endian CPUs in prod, so you've proven the point yourself. I agree that lexicographically arranging things can have an awesome impact on performance though. Still think that has zero to do with endianness.

> But with VPCMPEQB, you need to have 8 times as many conditional branches as CPCMPEQQ in order to turn your vectorized comparison into a memcmp result

Not at all. It doesn't require any branches, other than the loop itself. Here's the trick everyone uses, for the simple case of finding one character, e.g. NUL:

    0:  add $vectorlen,i  
        pmovdqa mem,vector  
        pcmpeqb query,vector  
        pmovmskb vector,bitmask  
        bsf bitmask,offset  
        jz 0b  
        add offset,i  
        ret

BSF gives you the byte offset of the match, and bam you're done.

KMag · on March 24, 2020

> But Google doesn't use big endian CPUs in prod, so you've proven the point yourself.

The whole point is you just memcmp the whole key byte array instead of having to parse out the fields. If the timestamp is in little-endian order, then memcmp won't give you chronological ordering on the timestamp suffix.

In the case of pcmpeqb / pmovmskb / bsf, I was explicitly talking about memcmp, not finding the first same or differing byte. I don't want to get too bogged down in specific architectures vs. the more general pros/cons of byte orders themselves... but... there's pcmpeq and pcmpgt, but no pcmpne. That's fine, so to implement memcmp, you'd just xor with -1 before bsf to find the index of the first different byte and then explicitly compare the differing byte. So far so good, except that Intel explicitly designed AVX to scale up to 1024-bit vectors, at which pmovmskb can't pack 128 bits into a GP register. Fine, so use pcmpeqw/pmovmskw instead of pcmpeqb/pmovmskb, but then once you have the offset of the first differing uint16_t, you can't just do a native-endian comparison on the two uint16_ts. If the x86 were big-endian, then bsf with 64-bit GP registers would scale all the way up to 4096-bit vector registers without requiring an extra bit extractions or bswap instructions. It's fundamentally an advantage that a native-endian 64 bit subtraction performs an 8 byte lexographical comparison on big-endian machines.

Granted, the advantage is tiny, but this whole rabbit hole discussion was in reply to "I presume the __only__ advantage to BE today is compatibility then..." (emphasis mine) in [https://news.ycombinator.com/item?id=22660000]

tomxor · on March 23, 2020

Ah, but isn't that all part of the same quirk of human language, i.e that we compare strings from left most significant to right least significant?

KMag · on March 24, 2020

It's not some quirk. There are two common endiannesses, and the byte sort precedence of one of them matches the byte sort precedence we use for text. Since we represent characters as numbers, and for efficiency we have CPU instructions that treat aggregations of these numbers as larger numbers, we can get double-duty out of these instructions/circuits if the order they use is the same as the lexographic ordering of our text.

I'm not saying big-endian is universally better. I'm just saying there are real-world advantages.

Note there are more than two endiannesses, where the most common middle-endian variant is big-endian order of 16-bit words, but little-endian order within those 16-bit words.

As for linguistic quirks, there's no obvious universal connection between writing order for text, the order digits are written, and the way numbers are pronounced, so I'm not sure what you're getting at. Arabic is written right-to-left, but they still write the most significant digit on the left. However, I've read that Arabic writers line-break numbers (when forced) by placing the most significant digits on the upper line and the least significant digits on the lower line, so it's not a clear-cut case of Arabic writing numbers in little-endian order right-to-left. I'm not sure about Arabic number pronunciation. I do know that German pronunciation is middle-endian: 256 is "two hundred six-and-fifty", despite the digit writing order being big-endian in German. I met a German guy who incorrectly swaps digits if you talk to him in English while he's doing mental math... forcing him to process English somehow also swaps digit orders in his head.

kps · on March 22, 2020

Simarly, x86_64 condition codes have a parity flag because the terminal needed it.