Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So every function call will need to spill even more call-clobbered registers to the stack!

Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?



With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.

It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.


> It's up to the compiler to decide how many registers it needs to preserve at a call.

But the compiler is bound by the ABI, isn't it? (at least for externally visible entrance points / calls to sub routines external to the current compilation unit)


Someone decides how to define the ABI too, and that is a free choice. CPU register count doesn't constrain this particular system design question.


Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.


That depends on if something like LTO is possible and a function isn't declared to use one of the plethora of calling conventions. What it means is that new calling conventions will be needed and that this new platform will be able to use pass by register for higher arity functions.


LTO is orthogonal to inlining functions.


ARM Aarch64, IBM POWER and most RISC CPUs have had 32 general-purpose registers for many decades, and this has always been an advantage for them, not a disadvantage.

So there already is plenty of experience with ABIs using 32 GPRs.

Even so, unnecessary spilling could be avoided, but that would require some changes in programmer habits. For instance, one could require an ordering of the compilation of source files, e.g. to always compile the dependencies of a program module before it. Then the compiler could annotate the module interfaces with register usage and then use this annotations when compiling a program that imports the modules. This would avoid the need to define an ABI that is never optimal in most cases, by either saving too many or too few registers.


Yeah, this approach has been proposed, in an even more aggressive form, see David W. Wall, "Global Register Allocation at Link Time", 1986 [0], specifically for machines with large number of registers: the paper straight up says that it's much less of a problem if you a) have less registers, b) don't compile things separately. And this problem with large register files has been predicted long before, see wonderfully named "How to Use 1000 Registers", 1979 by Richard L. Sites (he would later go work on VAX and design Alpha ISA at DEC).

[0] https://dl.acm.org/doi/10.1145/12276.13338

[1] https://files01.core.ac.uk/download/pdf/9412584.pdf


Thanks for the links.

Register allocation at link time would work only for ISAs where all registers are equivalent or they are partitioned only in a small number of equivalence classes.

It cannot work for an ISA like x86-64, where register allocation and instruction choice are extremely interdependent (e.g. the 16 Intel-AMD general-purpose registers are partitioned into 11 equivalence classes, instead of in at most 2 to 4 equivalence classes, like in the majority of other ISAs, where only a stack pointer and perhaps a return link pointer and/or a null register may have a special behavior). With such an ISA, the compiler must know the exact register allocation before choosing instructions, otherwise the generated program can be much worse than an optimum program. Moreover, an ideal optimizing compiler for x86-64 might need to generate instructions for several alternative register allocations, then select the best allocation and generate the definitive instruction sequence.

With such a non-orthogonal ISA, like x86-64, for the best results you need to allocate registers based on the register usage of the invoked procedures, as assembly programmers typically do, instead of using a uniform sub-optimal ABI, like most compilers.


Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?


More registers leads to less spilling not more, unless the compiler is making some really bad choices.

Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).


Yes, the argument for increasing the number of GPRs is precisely to eliminate the register spilling that in necessary in x86-64 programs whenever a program has already used all available architectural registers, a register spilling that does not happen whenever the same program is compiled for Aarch64 or IBM POWER.


So, let's take a function with 40 alive temporaries at a point where it needs to call a helper function of, say, two arguments.

On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.

If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.

So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.


You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.

The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.


> You’re missing the fact that the compiler isn’t forced to fill every register in the first place.

Temporaries start their lives in registers (on RISCs, at least). So if you have 40 alive values, you can use the same one register to calculate them all and immediately save all 40 of them on the stack, or e.g. keep 15 of them in 15 registers, and use the 16th register to compute 25 other values and save those on the stack. But if you keep them in the call-invariant registers, those registers need to be saved at the function's prologue, and the call-clobbered registers need to be saved and restored around inner call sites. That's why academia has been playing with register windows, to get around this manual shuffling.

> The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.

Would you be willing to work through that proof? There may very well be less total memory traffic for machine with 31 registers than with 16; but it would seem to me that there should be some sort of local optimum for the number of registers (and their clobbered/invariant assignment) for minimizing stack traffic: four registers is way too few, but 192 (there's been CPUs like that!) is way too many.


32 registers have been used in many CPUs since the mid seventies until today, i.e. for a half of century.

During all this time there has been a consensus that 32 registers are better than less registers.

A few CPUs, e.g. SPARC and Itanium have tried to use more registers than that, and they have been considered unsuccessful.

There have been some inconclusive debates about whether 64 architectural registers might be better than 32 registers, but the fact the 32 are better than 16 has been pretty much undisputed. So there are chances that 32 is close to an optimal number of GPRs.

Using 32 registers is something new only for Intel-AMD, due to their legacy, but it is something that all their competition has used successfully for many decades.

I have written many assembly language programs for ARM, POWER or x86. Whenever I had 32 registers, this made it much easier for me to avoid most memory accesses, for spilling or for other purposes. It is true that a compiler is dumber than a human programmer and it frequently has to adhere to a rigid ABI, but even so, I expect that on average even a compiler will succeed to reduce the register spilling when using 32 registers.


The game is deeper than that. Your model is probably about right for the compiler you're using. It shouldn't be - compilers can do better - but it's all a work in progress.

Small scale stuff is you don't usually spill around every call site. One of the calls is the special "return" branch, the other N can probably share some of the register shuffling overhead if you're careful with allocation.

Bigger is that the calling convention is not a constant. Leaf functions can get special cased, but so can non-leaf. Change the pattern of argument to fixed register / stack, change which registers are callee/caller saved. The entry point for calls from outside the current module needs to match the platform ABI you claimed it'll follow but nothing else does.

The inlining theme hints at this. Basic blocks _are_ functions that are likely to have a short list of known call sites, each of which can have the calling convention chosen by the backend, which is what the live in/out of blocks is about. It's not inlining that makes any difference to regalloc, it's being more willing to change the calling convention on each function once you've named it "basic block".


Why is almost no one in this comment thread is willing to face the scenario where the function call has to actually happen, and be an actual function call? The reactions are either "no-no-no-no, the call will be inlined, don't you worry your pretty head" or "well, then the compiler will just use less registers to make less spills" — which precisely agrees with my point that having more registers ain't necessarily all that useful.

> Small scale stuff is you don't usually spill around every call site.

Well duh: it's small, so even just 8 registers is likely enough for it. So again, why bother with cumbersome schemes to extend to 32 registers?

And this problem actually exists, that's why SPARC tried register windows and even crazier schemes on the software side of things had been proposed e.g. [0] — seriously, read this. And it's 30 years old, and IIUC nothing much came out of it so excuse me if I'm somewhat skeptical about "compilers can do better - but it's all a work in progress" claims. Perhaps they already do as best they can for general-purpose CPUs. Good thing we have other kinds processing units readily available nowadays.

[0] David W. Wall, "Global Register Allocation at Link Time", 1986, https://dl.acm.org/doi/10.1145/12276.13338


This argument doesn’t make sense to me. Generally speaking, having more registers does not result in more spilling, it results in less spilling. Obviously, if you have 100 registers here, there’s no spilling at all. And think through what happens in your example with a 4 register machine or a 1 register machine, all values must spill. You can demonstrate the general principle yourself by limiting the number of registers and then increasing it using the ffixed-reg flags. In CUDA you can set your register count and basically watch the number of spills go up by one every time you take away a register and go down by one every time you add a register.


> Obviously, if you have 100 registers here, there’s no spilling at all.

No, you still need to save/spill all the registers that you use: the call-invariant ones need to be saved at the beginning of the function, the call-clobbered at an inner call site. If your function is a leaf function, only then you can get away with using only call-clobbered registers and not preserving them.


Okay, I see what you’re saying. I was assuming the compiler or programmer knows the call graph, and you’re assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings. Your assumption is for sure safer and more common for a compiler compiling a function that’s not a leaf and not inlined.

So I can see why it might seem at first glance like having more registers would mean more spilling for a single function. But if your requirement is that you must save/spill all registers used, then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables, and not on the number of hardware registers at all? If your machine has fewer general purpose registers than live state footprint in your function, then the amount of function-internal spill and/or remat must go up. You have to spill your own live state in order to compute other necessary live state during the course of the function. More hardware registers means less function-internal spill, but I think under your function call assumptions, the amount of spill has to be constant.

For sure this topic makes it clear why inlining is so important and heavily used, and once you start talking about inlining, having more registers available definitely reduces spill, and this happens often in practice, right? Leaf calls and inlined call stacks and specialization are all a thing that more regs help, so I would expect perf to get better with more registers.


Thanks for actually engaging with my argument.

> assuming it’s a function call in the middle of a potentially large call stack with no knowledge of its surroundings.

Most of the decision logic/business logic lives exactly in functions like this, so while I wouldn't claim that 90% of all of the code is like that... it's probably at least 50% or so.

> then isn’t the amount of spilling purely dependent on the function’s number of simultaneous live variables

Yes, and this ties precisely back to my argument: whether or not larger number of GPRs "helps" depends on what kind of code is usually being executed. And most of the code, empirically, doesn't have all that many scalar variables alive simultaneously. And the code that does benefit from more registers (huge unrolled/interleaved computational loops with no function calls or with calls only to intrinsics/inlinable thin wrappers of intrinsics) would benefit even more from using SIMD or even better, being off-loaded to a GPU or the like.

I actually once designed a 256-register fantasy CPU but after playing with it for a while I realised that about 200 of its registers go completely unused, and that's with globals liberally pinned to registers. Which, I guess, explains why Knuth used some idiosyncratic windowing system for his MMIX.


It took me a minute, but yes I completely agree that whether more GPRs helps depends on the code & compiler, and that there’s plenty of code you can’t inline. Re: MMIX Yes! Theoretically it would help if the hardware could dynamically alias registers, and automatically handle spilling when the RF is full. I have heard such a thing physically exists and has been tried, but I don’t know which chip/arch it is (maybe AMD?) nor how well it works. I would bet that it can’t be super efficient with registers, and maybe the complexity doesn’t pay off in practice because it thwarts and undermines inlining.


I recalled there were some new instructions added that greatly help with this. Unfortunately I'm not finding any good _webpages_ that describe the operation generally to give me a good overview / refresher. Everything seems to either directly quote published PDF documents or otherwise not actually present the information in it's effective for end use form. E.G. https://www.felixcloutier.com/x86/ -- However availability is problematic for even slightly older silicon https://en.wikipedia.org/wiki/X86-64

- XSAVE / XRSTOR

- XSAVEOPT / XRSTOR

- XSAVEC / XRSTOR

- XSAVES / XRSTORS


Eh, pretty much nobody uses them (outside of OS kernels?); and mind you, RISC-V with its 32 registers has nothing similar to those, which is why 14-instruction long prologues (adjust sp, save lr and s0 through s12) and epilogues are not that uncommon there.


A good compiler will only do that if the register spilling is more efficient than using more stack varibles, so I don't really see the problem.


You can't operate on stack variables without loading them into the registers first, not on RISCs anyway. My main point is that this memory-shuffling traffic is unavoidable in non-leaf functions, so an extremely large amount of available registers doesn't really help them.


op is probably referring to the push all/pop all approach.


No, I don't. I use a common "spill definitely reused call-invariant registers at the prologue, spill call-clobbered registers that need to survive a call at precisely the call site" approach, see the sibling comment for the arithmetic.


That’s the push all/pop all approach.


Big non-trivial functions generally profit from having more registers. They don't have to saved and restored by functions that don't use them. Not every computationally intense application has just a few numerical kernels that can be pushed on to GPUs and fit the GPU uarch well. Just have a look at the difference between optimized compiler generated ia32 and amd64 code how much difference more registers can make.

You could argue that 16 (integer) registers is a sweet spot, but you failed to state a proper argument.


Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: