Intel compares the performance to a Nios II/e (it's 4x faster!), but at 6 cycles/instruction, no multiplier, one bit per clock shifter (IOW: no barrel shifter), the II/e is about the worst little CPU you can imagine.
It will be interesting to see how it compares against the VexRiscv, an open source RISC-V CPU that's quite popular in the FPGA world with excellent performance (at ~1 Dhrystone MIPS/MHz it's roughly double that of the Nios V) for a very little resources.
An incomprehensible decision is to make the Nios V only support uncompressed 32-bit instructions. This is a CPU that will live inside an FPGA and that will often be paired with a single FPGA block RAM to store instructions and data. By not including 16-bit compressed instructions, they'll add some 30% bloat to the code.
A major negative of Intel's offering is that the Nios V is only available for the Pro version of Intel's Quartus toolchain.
The biggest benefit of the Nios V is that its debugger logic will seamlessly integrate with Intel FPGA's virtual JTAG infrastructure. This means that, as long as you stay within Intel's development ecosystem, embedded SW debugging should be a breeze. With other CPUs, there are additional hoops to make that work. (I wrote about that here: https://tomverbeure.github.io/2021/07/18/VexRiscv-OpenOCD-an...)
The Nios II is quite old. When II/e was designed you had basically the PicoBlaze as competition. In comparison the Nois II/e SW can be compiled using gcc and have less constraints on code size.
The point of these very small CPU cores is basically as a flexible state machine that can be updated (inlcuding fix bugs) in the control path. In a big FPGA design you can have tens, hundreds of them inside larger cores. So worst in terms of performance, but far from worse in terms of functionality, power consumption, flexibility in comparison with complex FSMs.
Today you could use a number of RISC-V implementations (the SERV for example is really tiny: https://github.com/olofk/serv) or other soft CPU cores.
But the Nios was really great when it came, esp Nios II, which removed the issue of having to use different tool chains for the compact, large version was very useful. You can easily extend the ISA, connect co-processors, had a clean interface to the rest of the design. And as long as you used Altera FPGAs free as in beer.
In large systems such a 3G, LTE base station there can be thousands of Nios II cores. They have (are) used in automotive and provides hard real time control.
Yeah, this was weird: A major negative of Intel's offering is that the Nios V is only available for the Pro version of Intel's Quartus toolchain.
I mean people can import the VexRISCV verilog and get a faster and better chip for "free" so where does Intel think the value add is here? It is quite curious.
Now it would be interesting if they made hard-logic versions to compete with say the Cortex-M series from ARM or something but this seems like it isn't in the cards.
If the middling performance and lack of compressed instructions aren't an issue, then the automatic BSP generation, the Platform Builder integration, and the debugger ease of use can be a pretty strong arguments. The Nios II/e was good enough for many slow control tasks.
I'm not quite certain what's the use case for most soft CPU's on FPGA-s. My experience is from the other side (Xilinx) and it's more than half a decade old, however looking at stuff it doesn't seem to me that things have changed that much.
What 2 useful CPU designs I used were the PicoBlaze, a bespoke microcontroller with a weird architecture designed to take advantage of FPGA implementation details, and was ridiculously tiny, like a RAM block + a few slices worth.
The other one was the ARM Cortex A9 core physical core on the Zynq FPGA.
MicroBlaze, Xilinx's own big soft-CPU core was a combination of slow and large, with a buggy compiler to boot.
> I'm not quite certain what's the use case for most soft CPU's on FPGA-s.
They are used for state machines and control algorithms that are complex enough that you don't want to implement them in HDL. High-speed interface (eg, DDR4, high-speed ADC) initialization and training is a common example.
> Intel compares the performance to a Nios II/e (it's 4x faster!), but at 6 cycles/instruction, no multiplier, one bit per clock shifter (IOW: no barrel shifter), the II/e is about the worst little CPU you can imagine.
I think the memory system in that example has a 3 cycle memory access latency, and the CPU waits for one instruction to arrive before requesting the next one. I'm almost surprised that they manage 0.464!
The Coremark score doesn't make any sense indeed. Where did you find these numbers?
> I think the memory system in that example has a 3 cycle memory access latency, and the CPU waits for one instruction to arrive before requesting the next one. I'm almost surprised that they manage 0.464!
That seems unbelievably large for V/m. A VexRISCV using 352 Cyclone V ALMs gives 10% better dhrystone than V/m, and a configuration using 1764 ALMs gives you 1.2 DMIPS/MHz, I and D caches, branch prediction, Supervisor mode, MMU -- full Linux kernel capability.
C extension saves code size, but the C decoder uses LUTs so on a small program you can lose overall by implementing C.
If you assume that you're trading LUTs for LUTRAM at 4 bits per LUT then each instruction that uses C saves 16 bits or 4 LUTs. If half the instructions in the program can use C then you can save about 2 LUTs per instruction by implementing C.
If it takes 200 LUTs to implement C (that might be a little low?) then your program code needs to have 100 instructions before you even start to win.
I'd imagine a lot of places where you're replacing something that could almost be a state machine would use smaller programs than that.
Some popular small RISC-V cores don't implement C, and I think all the others give you the choice.
I'm not familiar with Intel's FPGAs but Google suggests "M9K" is the unit of block RAM across at least some of their range. That's 8192 bits or 1 KB. The RV32I [1] register set is 128 bytes. A lot of things wouldn't need any additional state, leaving 896 bytes for code -- 224 instructions without C, maybe 300 with C. If you're under 224 instructions then there's no point implementing C.
[1] you can save 64 bytes of registers by using RV32E. Tests using Embench show, IIRC, up to a 30% code expansion by using RV32E due to register spills and reloads. So even with 1 KB total space for registers+program+RAM that might be a net loss. Or your code might not expand at all. RV32E does save interrupt latency / thread switch time. You're probably not doing either of those.
In semi-related news: Initial RISC-V KVM (think fast virtual machines) support for Linux has been pulled by the KVM Maintainer and queued for inclusion in the 5.16 release.
Does anybody have informed opinions about the future of RISC-V? The pace of development in the space seems frantic (though early) and the IP situation seems like it's ripe to eventually become a major force in processor ISAs.
What I’ve seen is that a product is being designed with both ARM and RISC-V cores. The RISC-V cores outperformes the equivalent ARM cores and have no licensing costs. It’s used if it’s a management core not usually programmed by the customer or specialized low power and/or low latency processing. The ARM cores are used for TrustZone, FPU and DSP functionality and for the extensive IDE, debugging and trace support.
I can see RISC-V gradually taking over, since it now has a foot in the door. But it’ll take a long time and depends on third party development.
I'm optimistic but not particularly bothered as long as X86 goes the way of the dodo i.e. people have argued that ARM could be the better ISA for the processor in your laptop.
One thing that we will have to see about is whether the extension model they have now actually holds in the future e.g. will we see a high performance fork of RISCV in the future, or simply (say) a chinese-only company who don't particularly care using parts of the instruction encoding space for their own purposes.
Yes. By design (1), RISC-V removes much of what makes OoO harder. Most instructions have very regular and clean semantics, communicating only through registers, with a straight forward [up to] two register operands, at-most one destination. The memory model by default is the same as Arms, a fairly weak model. Basically code that works on Arm will port over trivially (vector and special functions excluded).
(1) Unfortunately compressed instructions were included in the Unix profile which causes a lot of pain and makes it mostly impossible to do any partial predecode at I$ fill time. Ironically Arm64 wised up and removed the Thumb variants. (Preempting the code density crowd: there are other ways to deal with this).
EDIT: Pain includes dealing with instructions that crosses cacheline or even page boundaries (hello double fault), but worse, not being able to tell what might be an instruction in the I$. It was a sad day when RISC-V with essentially no input from the higher-end concerns decided to force it on us.
That only applies if you want to run binary distros and applications such as Fedora or Ubuntu. If you're running a supercomputer and don't want to implement the C extension then you can build your own kernel and programs. If you do that then anything from RV32IA and up can be used.
You can also use a QEMU or Valgrind-like thing to JIT the C away.
Or you can do what modern x86 does and annotate your cache lines the first time they are actually decoded instead of at cache fill time. That's just a little slower for cold code but the same for hot code. Having the C extension keeps 40% more of your code hot for any given cache size.
There might be some implementation point at which Aarch64 wins by having bigger but fixed length code, but it's not at the low end and I don't think it's at the very high end either.
Two instruction lengths makes it slightly harder than one length, but we're not talking x86 15 lengths here.
More compact code takes pressure off icache size and refill rates, and transfer rates from icache to fetch buffer. And ROM size in embedded, sich as on FPGA.
In general it has a bright future. It is not nearly as revolutionary as, for example, the Mill CPU by Out Of the Box Computing, but it does have some refinements from all the lessons learned over the decades of ISA design.
As more designs get produced, there will be more and more pressure on the chip designers to stick to the published specifications, instead of going their own way. Why? Because everyone will want to just use the existing toolchains (GCC, LLVM, etc.) and not have to support proprietary extensions. It helps that the specs themselves are easily available.
The implementations that exist now are not at the leading edge of performance, but they are more than good enough for many applications. This will improve over time.
One of the best features of the ISA is the design for extensions... meaning that extensions were planned from the very beginning. Unlike other architectures, where often the ISA is designed to be "complete", and adding extensions inevitably becomes difficult, expensive (cost or run-time performance, or both), or complex (which slows adoption).
> eventually become a major force in processor ISAs.
In what market segment?
Will RISC-V win a substantial market shares in embedded / micro- controller usage? Very likely. It is sort of started happening already.
Will RISC-V wins over a substantial market in current x86-64 and ARMv8 segment. Highly Unlikely. I have seen zero argument or proposition that even make sense for this to happen. Other than people want it or certain country want it for whatever ideology reason.
I suspect that building a CPU from scratch, without a ton of legacy, may allow for greater efficiency, both per watt and in absolute speed. This is how RISC-V designs can percolate from embedded world to laptop and even datacenter world, much like ARM did before it.
Except it's not really from scratch. While we are starting to see A64-only cores, A64 was obviously designed to run on the same cores that also run A32/T32 code.
There are A32 features that A64 carries over which no one else in the 35 years of RISC ISA design since 1985 has seen fit to copy -- the optional shift/rotate of one ALU operand being the obvious one. Would they really have done that if it was a clean sheet design? If it's so great, why hasn't anyone else done it?
Also no other clean sheet ISA designed for high performance since 1990 (it's a short list: DEC Alpha, Intel Itanium, RISC-V) has included condition codes. Even POWER acknowledges that a single condition code register is a bad idea and has eight of them instead (the others just use integer registers to hold long-lived conditions).
I think people are too optimistic, even if it ever manages ARM adoption scale, nothing prevents OEMs to ship their RISC-V implementation with proprietary extensions.
But that's not a problem at all.
It still guarantees that a program compiled for RISC-V will run on this CPU.
Moreover, they have no motivation to introduce too many non standard/proprietary extensions as it would require them to add support for these new instructions in mainstream compilers, while they could - with standard extensions - take advantage of the work already done for standard extensions.
Today, a major difficulty a newcomer to the industry faces is that they have 2 options:
1 - Taking an existing ISA and paying a license fee, but benefiting from the work already done on the compiler side.
or
2 - Building their own new ISA for free, but having to do all the work on the compiler side by themselves from scratch.
With RISC-V this problem does not exist anymore which opens the door to newcomers.
Hardware and Software have very different dynamics, therefore the comparison with POSIX is very weak.
There is no way today to build competitive opensource hardware, and RISC-V is not about providing opensource hardware. With RISC-V (or any other ISA) we remain mainly dependent on the manufacturer who implement whatever they want within your processor. I think you are expecting too much from RISC-V, it's not about winning a war against manufacturers, it's about providing a royalty-free high-performance standard ISA which makes it easier for newcomers to enter the market.
RISC-V is built around the idea that manufacturers can make their own non-standard extensions and benefit from existing work.
My opinion is not informed, but from my point of view the opposite is true. It's been 10 years since the first specs and silicon, according to https://riscv.org/about/history/. And it seems that one has to go way out of their way to find RISC-V systems in the wild.
It's ten years since the idea occurred to start a new design.
The first silicon with the final ISA design (or at least RV32IMAC) was the FE310 which shipped on the HiFive1 in December 2016, less than five years ago.
It takes up to two years to go from RTL working in verilator or on an FPGA to manufactured chips on boards in shops.
RISC-V is all over the place in deep embedded. Samsung announced that their 2020 high end phones have RISC-V cores controlling the camera and also the 5G radio. Qualcomm is using RISC-V in their 5G. The very popular ESP32 series of WIFI/BT chips is switching to RISC-V, with the last three (?) models being either partially or fully RISC-V instead of Xtensa.
The last several months have seen announcements of startups who have chip designers from Intel, AMD, and Apple's M1 team and who are now working on *lake/Zen/M1-class RISC-V designs. Those will of course take maybe three or four years to appear on the market, but there is zero reason to think they won't be technically successful.
Just like Nios II, it's a processor. It doesn't have to be "hard" silicon. Calling it a "little FPGA program" seems very dismissive. I don't know what you mean by "1990 style".
They said "5 stage", which sounds to me like no out-of-order fancy stuff of the sort we've been used to.
> Calling it a "little FPGA program" seems very dismissive.
Well, I wrote a little 5 stage CPU FPGA program once (In Haskell compiled to Verilog :), but that's another story). It wasn't very hard.
I haven't made a production IC, but I'm told that's much harder. I would be awesome if Intel made a RISC-V chip, even a slow one just good for arduino-type toys, but that's not what happened here.
> They said "5 stage", which sounds to me like no out-of-order fancy stuff of the sort we've been used to.
That's inefficient for an FPGA softcore; wires are too expensive, CAMs are straight up awful, and memory latencies aren't too far off relative to the core clock frequencies to justify OOO stuff in the normal case.
> Well, I wrote a little 5 stage CPU FPGA program once (In Haskell compiled to Verilog :), but that's another story). It wasn't very hard.
Writing a 5 stage is easy.
Writing a 5 stage with no bugs is much harder.
Writing a 5 stage that talks industry standard busses and provides Debug/JTAG Support at high frequency and small gate counts is, well, an actual job.
Not saying you can't do better, but it's not a trivial effort.
> They said "5 stage", which sounds to me like no out-of-order fancy stuff of the sort we've been used to.
There's a lot of processors out there. The vast, vast majority being shipped are in order and a handful of stages.
> Well, I wrote a little 5 stage CPU FPGA program once (In Haskell compiled to Verilog :), but that's another story). It wasn't very hard.
> I haven't made a production IC, but I'm told that's much harder. I would be awesome if Intel made a RISC-V chip, even a slow one just good for arduino-type toys, but that's not what happened here.
The FPGA vendor provided soft cores have about the same amount of engineering rigor as a hard core. They have enough customers for the designs to have enough reach to have the same financial implications for bugs.
The FPGA vendor provided soft cores have about the same amount of engineering rigor as a hard core.
I'm not sure what you mean by engineering rigor, but there is indeed plenty of engineering to get from a soft core to a hard core, let alone a fabricated, working chip: synthesis, layout, place and route, timing closure, pads, PLL/DLL, clock tree insertion, BIST, thermal, packaging, etc...
High end soft cores do most of that. Particularly when you consider that most of what isn't automated these days is equally applicable to soft cores that aren't just Verilog or VHDl. Floor planning FPGA cores these days can be nutty compared to the recent past (far past interestingly enough was similar before you had higher level tools). And the fact that stuff like packaging, DLLs/PLLs are equally applicable to hard cores. That's outside of what they care about on nearly every core design I've seen (albeit I'm only in my mid 30s).
Intel has a terrible track record with maker-ish stuff. They routinely launch products, sell them for a year, then discontinue them. (Intel Euclid vision SBC, Intel Edison x86 microcontroller, the realsense stuff, etc)
You can't build an ecosystem around a product if you kill it after a few months. Arudino is only Arduino because they actually stuck around.
Well, lately anyway. The 386EX [1] was supported from 1994 to 2006 with many interesting and "cheap" third party SBCs, replaced in 2007 by the Tolpai SOC line which was killed the following year. It's been all churn since then it seems.
You can still buy tons of 386EX chips and boards on eBay and even a couple first-hand (JK Flashlight) and there's a few modern homebrew designs out there.
I think that level of sophistication is typical of soft cores. If you need a high performance CPU you don’t use a soft core. You use a soft core when part of your FPGA project can be more easily expressed as a CPU, but still needs tight coupling with the rest of the FPGA.
This is meant to compete with Cortex-M1 and MicroBlaze, not with Cortex-A78 or Core i9 or something.
You don't have that fancy stuff in really low cost, embedded systems, or as part of the control path in a large, complex FPGA or ASIC design. Look at ARM Cortex M0+, M3, M4 or numerous RISC-V cores for microcontroller applications.
You get the core you need that meets the processing budget with as little cost, area/resources needed and with as little power consumption. That was true then and it is just as true today.
I mean, it's a soft core like Nios I through IV. I know Altera would sell you a license if you wanted to dump it into a hard flow for some reason. I assume Intel will too? The Altera division seems fairly independent post acquisition.
Also Intel's not intimidated by the gate count niche of classic five stage RISCs; they gave that market up decades ago.
Oh, you're totally right. It was almost ten years ago that I used a Nios II, so I read the state space wrong without bothering to look it up. Lol, classic mistake on the internet.
A processor design in a HDL like Verilog or VHDL is not really a "program". It is a textual representation of physical hardware configuration. Seeing it as a program is usually the the root of problems faced by amateur HDL designers.
So true. The code describes real hardware - wires, gates, registers, memory macros. And those are finite resources. The FPGA tools allocate resources and map them to specific instances on the FPGA chip. In an ASIC there are more freedom, but the chip area, number of metal layers to route signals and delay, power sets constraints. Understanding that writing something in SystemVerilog, VHDL actually allocates physical, electrical objects is very important.
It is not bit patterns stored in a memory and interpreted, executed by something as a sequence of instructions. Routing congestion for example doesn't exist as a concept in SW.
It is not a program. It is a hardware design tuned to be efficiently mapped onto the primitives/resources in the specific FPGA fabric of the target devices. I'm fairly sure the actual source code is far from little. But the number of resources needed is quite low.
Given a clock and access to memory it will execute programs.
Could you elaborate on the "1990 style"? What do you mean?
Is it that it is a moderately pipelined, single issue, RISC like (explicit load/store, fixed width instructions, fairly large register file) CPU design? Why then, all RISC-V, ARM etc cores targeted for low cost embedded, IoT systems are 1990 style.
This is out of my realm, isn't it at least a little relevant to vet a (logical) design with an fpga before doing the full process? Or are those worlds things worlds apart?
They contain hundreds up to many thousands of FPGAs. Using support dev tools, you can take a large ASIC design and partition it over these FPGAs.
We once used four interconnected machines like these to emulate a large superscalar, OOO CPU design. It took a few days, but could go from release of reset to loaded OS. Something impossible with a SW simulation. Fun times.
Little soft core CPUs like this are often incorporated into larger FPGA designs to handle sub-tasks that would otherwise be annoying to implement with pure logic. Think something like a simple CLI console to configure low level board functionality, etc.
In general yes. ASIC designs are prototyped using FPGAs. You often can't run at full clock speed. But can do orders more test cycles than in a simulator.
But FPGAs are also used as final target technology. When volume is too low, time to market too tight, where flexibility is important, FPGAs are used. Building an ASIC takes 12-15-24 months (depending on design complexity, target process), and a respin (due to an error) of parts of the design adds months before the product can be released.
Base stations for 3G/LTE/5G is a good example. They used to be built with a combination of ASICs and DSPs. But this took too long to develop, and was to rigid to handle upgrades (soft upgrade from 3G to LTE was/is a great selling point). Using FPGAs reduced development time, and provided generational upgrade in a much better way. Don't replace boards, just SW and FPGA bitstreams.
You could make a physical chip out of this, but there are lot of O(1) costs that mean I am not just going to download Quartus, call up my local fab, and get me a toy CPU from this for fun.
It's a pretty smart decision for them. They save who knows how many millions/billions in supporting a very proprietary ISA and the resulting proprietary software stack.
I have no idea how you arrived at a millions/billions number.
The Nios II family has existed for years, with very little ongoing development. FPGA soft cores don't go above the ~1.5 Dhrystone MIPS per MHz because the increased complexity to go above the 5-stage pipeline doesn't play well with FPGA-style logic. The Nios II/f already occupied that space.
Nios II is more than hardware. How do you run code on it?
They have to maintain all the compiler stuff, all the libraries, all the related OS kernel and library changes.
The core also needs updates to keep it current with newer hardware interfaces.
With just a team of 50 people to do all these things, that's a minimum of 10-20 million per year. That's not including the initial development team being much larger. Over a decade or two, hundreds of millions or more is not unlikely.
Like I wrote, the Nios II is old. There’s barely any development going on. The compiler has a few bug fixes here and there.
The libraries are all in C, CPU agnostic, and primarily there to support various Intel Platform Designer modules (you can easily verify that by looking at the BSP). None of that goes away when switching to RISC-V.
I have no idea what you mean with “newer hardware interfaces”, especially in the context of reducing cost: when using RISC-V, you’ll have these newer hardware interfaces (whatever you mean by it) just the same.
It’s not about how much money it costs to maintain the Nios infrastructure, its about how much they’ll save by adding(!) an extra CPU to support.
Somewhat aside: most FPGA IP development of Intel happens in Malesia, so don’t assume US salaries… But if you estimate it at $10M-$50M, shall we agree that writing “billions” was more than a bit over the top?
Intel will save a tiny bit on not having to maintain a custom GNU toolchain. That’s about it.
It will be interesting to see how it compares against the VexRiscv, an open source RISC-V CPU that's quite popular in the FPGA world with excellent performance (at ~1 Dhrystone MIPS/MHz it's roughly double that of the Nios V) for a very little resources.
An incomprehensible decision is to make the Nios V only support uncompressed 32-bit instructions. This is a CPU that will live inside an FPGA and that will often be paired with a single FPGA block RAM to store instructions and data. By not including 16-bit compressed instructions, they'll add some 30% bloat to the code.
A major negative of Intel's offering is that the Nios V is only available for the Pro version of Intel's Quartus toolchain.
The biggest benefit of the Nios V is that its debugger logic will seamlessly integrate with Intel FPGA's virtual JTAG infrastructure. This means that, as long as you stay within Intel's development ecosystem, embedded SW debugging should be a breeze. With other CPUs, there are additional hoops to make that work. (I wrote about that here: https://tomverbeure.github.io/2021/07/18/VexRiscv-OpenOCD-an...)