This is how many registers the ISA exposes, but not the number of registers actually in the CPU. Typical CPUs have hundreds of registers. For example, Zen 4 's integer register file has 224 registers, and the FP/vector register file has 192 registers (per Wikipedia). This is useful to know because it can effect behavior. E.g. I've seen results where doing a register allocation pass with a large number of registers, followed by a pass with the number of registers exposed in the ISA, leads to better performance.
What you describe sounds counter-intuitive. And the paper you cite seems to suggest an ISA extension to increase the number architected (!) registers. That is something very different. It makes most sense in VLIW architectures, like the ones described in the paper. Architectures like x86 do hardware register renaming (or similar techniques, there are several) to be able to exploit as much instruction level parallelism as possible. That is why I find you claim hard to believe. VLIW architectures traditionally provide huge register sets and make less use of transparent register renaming etc, that part is either explicit in the ISA or completely left to the compiler. These are very different animals than our good old x86...
> Compiler controlled memory: There is a mechanism in the processor where frequently accessed memory locations can be as fast as registers. In Figure 2, if the address of u is the same as x, then the last load μ-op is a nop. The internal value in register r25 is forwarded to register r28, by a process called write-buffer feedforwarding. That is to say, provided the store is pending or the value to be stored is in the write-buffer, then loading form a memory location is as fast as accessing external registers.
I think it over-sells the benefit. Store forwarding is a thing, but it does not erase the cost of the load or store, at least certainly on the last ~20 years of chips and I don't think on the PII (the target of the paper) either.
The load and store still effectively occur in terms of port usage, so the usual throughput, etc, limits apply. There is a benefit in latency of a few cycles. Perhaps also the L1 cache access itself is omitted, which could help for bank conflicts, though on later uarches there were few to none of these so you're left with perhaps a small power benefit.