One thing directly connected to this history is why the x86 is little-endian. As the article explains, the 8008 was designed for the Datapoint 2200 terminal. The 8008 was intended as a compatible replacement for the existing Datapoint processor, which was built from simple TTL chips.
To reduce the chip count, the Datapoint 2200 used a serial processor, which processed one bit at a time, so you had a 1-bit ALU among other things. One consequence is that you have to start with the low-order bit when doing addition, so you can handle carries. And for 16-bit values, you also have to start with the low-order byte. This forces you into a little-endian architecture.
Thus, to be compatible with the Datapoint 2200, the 8008 was also made little-endian. Unfortunately, Intel was very slow creating the 8008, so Datapoint had moved on to a parallel 74181-based architecture and didn't want the 8008. Intel decided to sell the 8008 as a stand-alone product, essentially creating the microprocessor industry. As the article explains, the x86 grew out of the 8008, so x86 also inherited the little-endian architecture
I don't think that really forced any particular endianness.
Endianness is only really a thing if there's addressable memory, which is not the case for a bit stream. Whether a system is big/little/mixed-endian was only a question of how the address lines are set for multi-byte accesses.
I thought that too at first, but then I found this stack-overflow answer fairly convincingly confirming the OPs claim with quotes from it's designers: https://stackoverflow.com/a/36263273 although it's still not explicitly clear why.
> Shustek: So, for example, storing numbers least significant byte first, came from the fact that this was _serial_ and you needed to process the low bits first.
> Poor: You had to do it that way. You had no choice.
I'm wondering if it was the bit order that mattered for compatibility with the 1-bit serial processor of the Datapoint 2200, and if that determined byte ordering of it's 16bit successor?... I don't know much about 1-bit computing, but if you think about it - to be useful, it must be capable of doing operations on arbitrary length words (within some reasonable limitation), i.e not limited to 8 bits like the 8008. So maybe this put it in a funny position: while byte endianess makes no sense and bit endianness seem to make no difference for the 8008 since it was 8bit; it did make a difference to the datapoint which was serial 1-bit and could operate through the 8bit word boundaries.
To be clear what i mean, imagine how a 1-bit ALU would process 16 bits, it would want them in order like this (the bits are in logical order, not numerical order):
(8bit) Little Endian:
byte 0000000011111111
bit 0123456789abcdef
But not like this:
(8bit) Big Endian:
byte 0000000011111111
bit 76543210fedcba98
I don't know if this is correct, and i'm assuming here that bit endianness must determine byte endianness in the 16bit 8086. Clarification/correction is most welcome.
When you add two binary numbers the simplest method for handling the carry is a ripple carry. The carry bit ripples up from lsb to msb.
If you're trying to build a simple 1 bit ALU you only need to save the carry bit as long as you add the bits from least significant to most. if you do it the reverse you have an intermediate result that needs to be saved.
As you can see you only need to keep the carry bit. The result bit gets written back to memory immediately.
What's going on is a 1 bit processor represents the most extreme trade off between gate count and speed for a given functionality. AKA low gate count but requires many clock cycles per operation.
> as long as you add the bits from least significant to most
Thanks, although I already understood what you explained, this bit made me realize more clearly the detail I was tripping up on. I was never imagining the 1-bit adder to start anywhere _but_ the LSB, what I couldn't figure out was why it couldn't simply iterate the bits in reverse logical order for big endian...
I'm guessing there is a good reason why serial processors can't read a sequence of bits in reverse logical order, and it probably applies to modern CPUs too? either that or it's a design choice that cannot be dynamic.
I should probably pick a book up at this point, thanks for answering my questions.
The carry between bytes can be applied as-you-go with LE but has to be done after with BE. Which would double the execution time (adding zero/1 to each byte of the result).
Pretty much all simple multi-word ALU algorithms need to start with the low word first. That particular terminal wasn't remotely unique (though for all I know, it was indeed the specific product that drove the Intel design decision), nor was x86 the first LE architecture. The PDP/11 made the same decision for the same reason several years earlier (and the VAX followed for pseudocompatibility), etc...
Really, it wasn't until the mid-80's and the RISC revolution, where all of a sudden people found themselves designing systems that would be 32 bits wide from the very first silicon that the community "decided" that the only true byte order should be BE.
And of course, the reason they made that decision was simply that the order of bytes in memory happened to match the way readers of the latin alphabet write arabic numbers on paper. Had the arabs invented computing, there would never even have been a debate.
On the other hand, unsigned big-endian numbers sort lexographically and numerically the same way. So, for instance, if you're using a b+ tree to store time series of key-value pairs, you can use tuples of <key, big_endian_timestamp> as your b+ tree keys and use memcmp() on the raw bytes of your keys.
For variable-length integers, using a UTF-8-like big-endian encoding allows you to keep the relationship between lexographic and numeric sorting.
This is also a huge advantage of ISO 8601 over other date formats.
The mainframes that were byte-addressable (e.g. System/360) were big endian generally. I think this property led to the Motorola decision to use big endian on the 68000 and, because of its popularity in workstations, the RISC vendors followed suit. MIPS chose to make the endianness selectable by the hardware vendor, and that has become common, with little endian wining these days in ARM implementations.
> Really, it wasn't until the mid-80's and the RISC revolution, where all of a sudden people found themselves designing systems that would be 32 bits wide from the very first silicon that the community "decided" that the only true byte order should be BE.
IBM 360-series mainframes were 32-bit and big-endian in the 1960s. (Their earlier computers may have been big-endian too, I'm not sure.)
> (Their earlier computers may have been big-endian too, I'm not sure.)
Yes, their choice of big endian probably came from their earlier unit record equipment, which used punched cards as input, storage, and output. Since punched cards were (sort of) directly human readable, it was natural to use big endian.
Also, making punch cards bid-endian meant that BCD integers sorted the same way lexographically and numerically, so there wasn't any special sorting mode necessary for numeric fields in punch card sorting machines.
I've always had it memorised that LE is the "logical" order and BE the "backwards" order, specifically for this reason: with LE, the bit at offset n has value 2^n, and the byte at offset n has value 256^n. With BE it depends on how wide the total value is, which introduces an awkward length-dependent term.
Practically all multi-precision arithemtic libraries store the segments of a bignum in LE order too, despite the fact that the individual segments may be BE on a BE machine. That is even more confusing.
That misunderstands history. The earlier DEC machines (e.g. PDP-1/6/10/20 were all BE (as were Multics and mainframes of the time); the PDP-10 is why network ordering is BE.
And while I consider the PDP-6 to be the first RISC architecture, really that title goes to 1970s machines like the 801.
Little endian is natural memory ordering and was specified in Von Neumman's EDVAC design doc during world war ii, so the question isn't so much, "why little endian?" but rather, "why not?" Only reason computers were ever designed to use "traditional" ordering is because business mainframes loved BCD and they loved to code GUIs that just blit'd raw fixed memory onto the display driver.
As I mentioned elsewhere, the fact that big-endian positive integers (and IEEE-754 normalized floats/doubles) sort the same lexographically and numerically is useful. The same goes for UTF-8-like big-endian encodings for variable length integers and ISO 8601-formatted dates / times. It's handy to be able to use memcmp() on compound b+ tree keys.
The advantage works both ways: it's also nice if your normal integer vector instructions can be used to accelerate memcmp, without needing an extra opcode or instruction flag to specifically support parallel lexographical comparison. Note that if the vector units only support signed integers, then you only need an exclusive-or to flip all of the sign bits in order to get lexographical sorting, and that vector exclusive-or has important real-word use cases for hash functions and so-called add-shift-xor encryption algorithms like ChaCha20.
Alternatively, if you're going to include a dedicated lexographical comparison instruction for strcmp/memcmp (like x86 REP CMPS), a fast implementation will need less dedicated circuitry if your processor natively supports big-endian operations for other instructions.
Big-endian use in mainframes likely evolved from big-endian BCD numeric fields on punch card sorting/tabulating machines, where big-endianness was an advantage in that the normal lexographic sorting would put constant-width numeric fields in numerical order.
SWAR techniques are orthogonal to endianness. Read Hacker's Delight; it mentions big endian only once, to explain how to convert away from it. Describing variable-length encoding as big-endian-like is quite a creative stretch. Xor does not flip two's complement sign correctly, except in a few special cases like YCbCr. Please read the EDVAC design doc, since Von Neumman invented two's complement too. His ideas are the reason why programmers have never needed to care much about whether it's signed or unsigned, since the machine arithmetic will behave exactly the same way (with few exceptions).
> Describing variable-length encoding as big-endian-like is quite a creative stretch.
LEB128 is a little-endian variable-length encoding. That's what the LE stands for.
UTF-8 is a big-endian variable-length encoding. The most significant bits are packed in the earlier bytes, so that lexographic ordering comes out correctly.
I'm really alluding to something like LEB128-encoding uint64_ts, except big-endian and putting all of the continuation flags at the beginning of the encoding, so a single switch on the first byte gives you the encoded length and it also sorts correctly lexographically.
> Xor does not flip two's complement sign correctly
Re-read what I wrote. I wasn't talking about flipping the sign bit in order to negate the numeric value. I was talking about flipping the sign bit to get correct lexographic text ordering on a machine that can only do signed comparisons.
> since the machine arithmetic will behave exactly the same way (with few exceptions).
I was talking about _exactly_ one of these exceptions... comparing two values. For simplicity, let's pretend we have a 16-bit BE CPU that can only perform signed comparisons, and we want to lexographically compare two 4-byte arrays:
[ 41 4F 4F 4F ] vs. [ C1 81 4F 4F ]
we perform the first load to compare
0x414F (16719) vs. 0xC181 (-15999)
in order to get the correct lexographical ordering in all cases, we need to invert the sense of the sign bit. Flipping the sign bit in these two cases gives
0xC14F (-16049) vs 0x8181 (33153).
In all cases, flipping the sign bit allows one to perform unsigned comparison on a CPU that can only perform signed comparisons. For an W-bit word, you end up subtracting (W-1)^2 from all values with the most significant bit unset and adding (W-1)^2 to all values with the most significant bit set. I understand this doesn't negate the values. That's not the point. The point is to perform unsigned comparison on a CPU that doesn't natively support it.
> Only reason computers were ever designed to use "traditional" ordering is because business mainframes loved BCD and they loved to code GUIs that just blit'd raw fixed memory onto the display driver.
Hah, so it's literally because we write numbers with most significant digit on the left, which is the opposite to logical ordering?
I presume the only advantage to BE today is compatibility then...
“We” being people with left-to-right languages. All rtl languages I know of write numbers the same as ltr languages, i.e. in the rtl case little endian.
As you have to read an entire positional number to be able to get even its magnitude; the only “advantage” I can see of LE over BE in natural languages is that at least rtl can get odd/even immediately. Then again in ltr you can get sign immediately. In real life human use neither “advantage” is compelling.
If you want your (vector or scalar) integer instructions to be useful for fast strcmp/memcmp implementations, then you'll want to support unsigned big-endian integers. Also, if you want to mix integers and strings in compound b+ tree keys and use memcmp to compare keys, you'll want to store your integers big-endian.
I don't believe you. That sounds more like bitreverse and morton interleaving. If you're serious that it's actually big endian and that it has a real use case (outside human social traditions and legacy infrastructure maintainence), then I'd love to see convincing details.
As for strings, GCC/Clang are great at vectorizing normal byte-oriented code. It doesn't matter if it generates or you hand code VPCMPEQB vs. VPCMPEQQ either, since they appear to have the exact same latency. Modern compilers have also been moving in the direction of punishing folks who do things like `* (uint32_t * )p`, due to aliasing rules rather than endianness. Rob Pike had an amusing blog post on this subject: https://commandcenter.blogspot.com/2012/04/byte-order-fallac...
> I don't believe you. That sounds more like bitreverse and morton interleaving. If you're serious that it's actually big endian and that it has a real use case (outside human social traditions and legacy infrastructure maintainence), then I'd love to see convincing details.
Back when I worked on Google's indexing system (2006-2010), one of the stages keyed documents by URL (with the host name in DNS order: com.google.www) followed by a big-endian timestamp. This put records for the same domain together, and put successive crawls for the same URL in chronological order. This improved compression ratios in our BigTable tablets, and finding the latest crawled version of <URL> was just a search for the largest key <= <URL><MAX_TIMESTAMP>.
> It doesn't matter if it generates or you hand code VPCMPEQB vs. VPCMPEQQ either, since they appear to have the exact same latency.
But with VPCMPEQB, you need to have 8 times as many conditional branches as CPCMPEQQ (in the naive case) in order to turn your vectorized comparison into a memcmp result. Note that VCMPEQ* don't modify any flags, so you can't just JE/JNE after your VCMPEQ*. Now, via some vector permutes, shifts, and bitwise-ors, you can cut down on the number of conditional branches. However, it's still more processing than if you can load / compare your vector registers in lexographical order.
But Google doesn't use big endian CPUs in prod, so you've proven the point yourself. I agree that lexicographically arranging things can have an awesome impact on performance though. Still think that has zero to do with endianness.
> But with VPCMPEQB, you need to have 8 times as many conditional branches as CPCMPEQQ in order to turn your vectorized comparison into a memcmp result
Not at all. It doesn't require any branches, other than the loop itself. Here's the trick everyone uses, for the simple case of finding one character, e.g. NUL:
> But Google doesn't use big endian CPUs in prod, so you've proven the point yourself.
The whole point is you just memcmp the whole key byte array instead of having to parse out the fields. If the timestamp is in little-endian order, then memcmp won't give you chronological ordering on the timestamp suffix.
In the case of pcmpeqb / pmovmskb / bsf, I was explicitly talking about memcmp, not finding the first same or differing byte. I don't want to get too bogged down in specific architectures vs. the more general pros/cons of byte orders themselves... but... there's pcmpeq and pcmpgt, but no pcmpne. That's fine, so to implement memcmp, you'd just xor with -1 before bsf to find the index of the first different byte and then explicitly compare the differing byte. So far so good, except that Intel explicitly designed AVX to scale up to 1024-bit vectors, at which pmovmskb can't pack 128 bits into a GP register. Fine, so use pcmpeqw/pmovmskw instead of pcmpeqb/pmovmskb, but then once you have the offset of the first differing uint16_t, you can't just do a native-endian comparison on the two uint16_ts. If the x86 were big-endian, then bsf with 64-bit GP registers would scale all the way up to 4096-bit vector registers without requiring an extra bit extractions or bswap instructions. It's fundamentally an advantage that a native-endian 64 bit subtraction performs an 8 byte lexographical comparison on big-endian machines.
Granted, the advantage is tiny, but this whole rabbit hole discussion was in reply to "I presume the __only__ advantage to BE today is compatibility then..." (emphasis mine) in [https://news.ycombinator.com/item?id=22660000]
It's not some quirk. There are two common endiannesses, and the byte sort precedence of one of them matches the byte sort precedence we use for text. Since we represent characters as numbers, and for efficiency we have CPU instructions that treat aggregations of these numbers as larger numbers, we can get double-duty out of these instructions/circuits if the order they use is the same as the lexographic ordering of our text.
I'm not saying big-endian is universally better. I'm just saying there are real-world advantages.
Note there are more than two endiannesses, where the most common middle-endian variant is big-endian order of 16-bit words, but little-endian order within those 16-bit words.
As for linguistic quirks, there's no obvious universal connection between writing order for text, the order digits are written, and the way numbers are pronounced, so I'm not sure what you're getting at. Arabic is written right-to-left, but they still write the most significant digit on the left. However, I've read that Arabic writers line-break numbers (when forced) by placing the most significant digits on the upper line and the least significant digits on the lower line, so it's not a clear-cut case of Arabic writing numbers in little-endian order right-to-left. I'm not sure about Arabic number pronunciation. I do know that German pronunciation is middle-endian: 256 is "two hundred six-and-fifty", despite the digit writing order being big-endian in German. I met a German guy who incorrectly swaps digits if you talk to him in English while he's doing mental math... forcing him to process English somehow also swaps digit orders in his head.
People's "Downloads" folder being full of New folder", "New folder (1)", all of them being full of similarly named documents.
Also anyone not using proper version control (business folks, designers, scientists, students) usually has a bunch of files all over the place with every kind of inconsistent names for a bunch of versions.
I can use proper version control for code, but when it comes to office products, I use an ISO date to prefix my "versions" of files.
This is really an integration of proper version control within the workflow. I don't find it appropriate to blame the user when they are so misguided.
Don't get me wrong, an ISO date prefix is nothing close to perfect for version control, but it:
- works anywhere you are asked to name a sequence of things
- tacitly communicates well the naming scheme so your colleagues can get it and possibly even start to imitate it
- give a clear time referral
- isn't sensible to system time-stamp, so this keep the chronology even when file are touched or copied elsewhere with a different time-stamp
- spontaneously generate a chronologically ordered result of versions when their are displayed in list
Once again, this is not my favorite way to organize things. But from a practical point of view, it often comes as the less worst solution you can run seamlessly.
That analogy only works if “report”, “report_final”, and “report_final_FINAL” were all released, were insanely successful and made you a billionaire and then you didn’t want to lose the customers that liked those versions when you made some small edits.
Excellent explanation in the article. Also note that in x86-64 mode, the low 8 bits of all registers can be accessed, namely: AL, BL, CL, DL, SIL, DIL, SPL, BPL, R8B, R9B, R10B, R11B, R12B, R13B, R14B, R15B. Previously, there was no equivalent of SIL, DIL, SPL, and BPL. https://docs.microsoft.com/en-us/windows-hardware/drivers/de...
For some of the L ones it’s possible to access the H part. So it’s not just a byte (B) it’s the low (L) byte and there is a high (H, just the second byte of the word) you can also get to.
For the high 8 registers it isn’t like that. There is no way to get to the “high” byte.
I believe D for doubleword is mostly Microsoft C/C++ thing, though I'm sure it appears in other places too.
When it comes to the assembler syntax 32bit word, my guess would be, that 32bit words (e.g AT&T x86 syntax, m68 assemblers) are mostly indicated as l (for long).
movl (at&t's x86) or mov.l (m68k).
Edit: Ah.. yeah, also intel x86 asm syntax uses it, so I guess that's where the idea of using D for 32bit values originates.
Most people list the registers in alphabetic order, but numerically they are encoded with the B registers in the 4th place: EAX=0, ECX=1, EDX=2, EBX=3, ...
If you ever find yourself writing an AMD-64 assembler, it really feels like you're digging through archaeology with all of the weird quirks you need to implement. The SSE, AVX, and AVX-512 encodings add even more levels of, "why did they do that?!?" which don't make much sense except in the context of history.
but numerically they are encoded with the B registers in the 4th place
I've never seen an authoritative answer for that but I believe it comes from the fact that in the 16-bit addressing modes, BX is the only one of the 4 ABCD registers that can be used to address memory, and the circuitry for decoding is very slightly simpler to detect two 1 bits than one 1 and one 0 bit. The 4 other registers are, in order, SP BP SI DI.
which don't make much sense except in the context of history
> B, C, D, E were completely generic and interchangeable.
I was under the impression that it went
Accumulator
Base
Counter
Data
I'm not sure about the D or E registers, but I am sure I remember using B as the base address register for arrays, and using C as the counter register for loops and such because the others couldn't be used that way.
Yes, in the Z80 BC was a sort of counter register (e.g. for LDIR or DJNZ instructions), which is perhaps why BC became CX. In the 8080 there wasn't much difference between BC and DE.
The interesting part is that BX maps to HL, which explains the weird order AX/CX/DX/BX in the encoding of 8086 instructions.
Alternatively the X is a very old assembly notation for "pair". 8-bit registers on 8080 processors could be paired together to work as a single register. Operations performed on these register pairs used the letter X.
You can look here at the original 8080 reference manual:
For example page 4 lists the name of instructions, INR is "increment register" while INX is "increment register pair", similar notation is used for several other instructions where the X is the register paired version of an instruction.
In the case of the AX register, it likely just refers to the pair of AH and AL registers.
At any rate, it's really interesting glancing over the various 8086 reference manuals. Gives me a deep sense of appreciation for how far things have come and how things have managed to build upon what are otherwise some very simple and fundamental building blocks.
I was wondering if it was going to cover RAX/amd64, and it does. Nothing terribly new here but it’s a nice dive into an interesting microcosm of intel architecture.
I do somewhat wish AMD managed to get R0-R7 as the standard, though :p oh well.
I love the x86 registers and their names and special roles.
On the one hand, it’s gross that x86 still has this legacy.
On the other hand, it’s a good thing that it’s possible to maintain compatibility so far back while still having such good perf. I find that aspect of modern x86 to be super impressive.
IMHO the story of the iAPX432 really is the whole industry in microcosm.
Intel hires the best of the best, the true cream of the crop to design them an ISA that is meant to crystallize everything that is known about ISA design into a single completely new design that discards all the broken crap of yesterday and will be futureproof for decades if not forever, and a chip to implement it.
Everything is going to be great forever, but it turns out it's a bit hard to get done, so they just have the B-team whip up something quick to sell in the meantime. This something was the 8086, which gets adopted into increasingly successful products, but no matter, the new shiny thing is going to displace all of that when it finally ships.
Then it does, and it actually has a hard time competing in performance with the much older stopgap product. It turns out that the team of superstars they hired was very theoretical, and built an ISA that was a dream to program against, but did not really understand what it took to build something that was going to be fast when implemented in hardware. (Also, the system was meant to be used mainly with high-level languages, and the compilers really were not there yet.) Being more expensive, much slower than the competition and with 0 market penetration, the iAPX 432 was dead on arrival in the market.
Luckily, the B-team had been busy working up on another extension of the stop-gap product, the 80286, which is again a runaway success, and only partially because of backwards compatibility with the existing x86 ecosystem. It was also quite fast for it's time.
I recall the iAPX432 came out years later than the 8086, was a 'capabilities machine' and took 500 memory cycles to execute "JMP ." e.g. jump to self, potentially the simplest possible operation.
So cool. Kinda disappointed at my hacker side of things because I never questioned why EAX register is actually named "EAX". Wondering what else I take for granted now :p
They wanted some sort of backwards ccompatibility at the assembly language level, the idea being that 8008 assembly code could be fed to an 8080 assembler. Not sure how well that worked out in practice, but that would be a motivation for A remaining a language-level feature that refers to the 8 bit part.
Aren't those AL and AH? When we rename A to AL, why do we need to permanently retire the term "A"? The L in AL stands for "low"; what does the A stand for?
The A stands for AL, in a snippet of 8008 assembly code that you're supposed to be able to use in the middle of 8080 assembly code, and dwhich is written in a language that knows nothing about AL or AX. Or that was the idea.
Thanks. So it's not so much that "you still have the opcodes that operate on just 64/32/16/8 bits" as "ASCII assembly code for any CPU is expected to be source-compatible with ASCII assembly code for any later CPU"?
Is there any indication in the source demarcating the 8008 assembly from the 8080 assembly?
But was that ever supported in practice? I don't remember that being supported, and 8008 assembly code needed to be translated anyways by a tool, so that could have taken care of A -> AL.
You're missing my point. AL is the "L"ow bits of the "A" register. AH is the "H"igh bits of the "A" register. The whole thing, low plus high, is the "A" register. We can tell that by the names AL and AH.
If you’re visiting the Bay Area, be sure to visit the Computer History Museum in Mountain View. It’s the Mecca of computing history. Also, for early internet history, I’d recommend Katie Hafner’s book: Where Wizards Stay Up Late.
I did and I loved every moment of it :) Also in meatspace, enjoyed Bletchley Park in the UK.
Re: books, I liked Robert X. Cringley's Accidental Empires, but it covers only until 91 (or 93 if IIRC for the 2nd edition) of the PC industry and not really technical.
"You might think—gee, seven is a very odd number of registers—and would be right! The registers were encoded as three bits of the instruction, so it allowed for eight combinations. The last one was for a pseudo-register called M. It stood for memory. M referred to the memory location pointed by the combination of registers H and L. H stood for high byte, while L stood for low byte of the memory address. That was the only available way to reference memory in [an] 8008."
Defining an architecture is easy! All you have to do is write up a document with instructions and their encodings, and boom! You’ve created a new architecture. Here’s one I made earlier this year for a class I was teaching, for example: https://github.com/regular-vm/specification
Assembly mnemonics and their cousin pin names are so excessively terse. I wish it was mandatory to expand acronyms in a dedicated field in datasheets. So often you'll find pin names like "NCE" where you're just expected to know a priori that it means "active-low Chip Enable" and it's so counterproductive.
Assembly mnemonics I think are terse to make it dirt simple for assemblers to read them, which themselves had to be hand-entered via hex until they became "self-hosting." Assemblers could not be on the same level as compilers. This was definitely very needed when these CPUs were new in the 70's and such.
I've always liked how 6502 neatly has everything in 3 letter opcodes. But that's not scalable given modern CPU capabilities.
CS, DS and ES also have letters in alphabetical order, yet they are treated as acronyms. Wondering why is there no meaning (even if fabricated) attached to FS and GS?
Not really sure what your question means, but originally CS for "code segment", DS for "data segment" and SS for "stack segment" had distinct use cases. They weren't general purpose segment registers at all. ES was an "extra" data segment register. When FS and GS were added later the alphabetical ordering of CS, DS, ES, FS, GS was natural (with SS still being the odd one out).
Even in 32 bit mode, x86 uses segmented menory access and CS, DS and SS retain their roles as default segments for code, data and stack memory accesses. IIRC, all the segment registers are special because the CPU caches the corresponding segment table entry transparently in the background. The table entries are fairly complex.
That's how we got virtualization on x86. When AMD killed segmentation in AMD64, VMware et al cried out loud. AMD reintroduced segmentation after that as a stop-gap solution, but then we got VTx and SVM which are a much better solution anyways.
I wish they'd done a better job of re-implementing segmentation. I can't tell you how many times my programs have stopped thanks to a segmentation fault. /s
Precisely what you wrote: CS is "Code Segment", DS is "Data Segment", ES is "Extra Segment" (even here it feels a bit manufactured), but FS and GS lack any semi-reasonable expansion.
And the 80286 predated the 80386 as a 32-bit processor, so the 80286 was Intel's first 32-bit offering. It was still segmented, missing the paging hardware. I helped write an operating system for it. Short-lived.
The iAPX 432 project started in 1975, but took many years to deliver anything. The 8086 was indeed intended as a stop gap while the ‘432 was under development.
The 80286 by most definitions was a 16-bit CPU, having a 16-bit data bus, 16-bit general purpose registers and 16-bit segments. It did have a 24-bit address space though, up from the 20 bits of the 8086.
Oh of course! Thanks. The 24-bit address space changed operating system designs to use a LONG to store physical memory addresses, which I guess confused my memory of the whole thing.
It had 'real mode' which was the ordinary 8086 addressing mode (capable of 20-bit addressing to 1MB) and 'protected mode' which we called 'imaginary mode' since nobody used it.
To reduce the chip count, the Datapoint 2200 used a serial processor, which processed one bit at a time, so you had a 1-bit ALU among other things. One consequence is that you have to start with the low-order bit when doing addition, so you can handle carries. And for 16-bit values, you also have to start with the low-order byte. This forces you into a little-endian architecture.
Thus, to be compatible with the Datapoint 2200, the 8008 was also made little-endian. Unfortunately, Intel was very slow creating the 8008, so Datapoint had moved on to a parallel 74181-based architecture and didn't want the 8008. Intel decided to sell the 8008 as a stand-alone product, essentially creating the microprocessor industry. As the article explains, the x86 grew out of the 8008, so x86 also inherited the little-endian architecture