I would like to suggest that the classical taxonomy of RISC/CISC dichotomy is basically non-existent nowadays -- namely because both sides have influenced each other. It is well known that CISC has taken a lot of inspirations from RISC designs (such as having a lot more registers in x64), and RISC designs also taken some inspirations from CISC (such as having SIMD/vectorization units). In other words, the line between RISC and CISC has been very fine lately.
Also, at the end of the day, they all turned into μops. If I remember Jim Keller correctly, ARM and x86, they are basically the same in the back nowadays, its just the frontend and their decoding units are different. And that's why he strongly suggested AMD to also adapt Zen design with ARM ISA during his tenure there, oh I think it is called K12.
I think there is another blurry line between superscalar and VLIW architecture, too.
Yep. RISC was interesting when gate budgets for CPU pipelines were seriously limited. It was interesting because before RISC the industry had been merrily spending the gate budget increase on adding lots of use-specific instructions. The RISC people pointed out that if you removed support for all the fancy instructions you had enough gate budget for the ALU to be nicely pipelined, and then you could wind up the clock rate greatly and this was worth much more than the fancy instructions.
For decades now we've had enough gate budget to have nicely pipelined designs with complex instruction sets, so that's what everyone does. RISC solves a problem that no longer exists.
"The Intel 386, originally released as 80386 and later renamed i386, is a 32-bit microprocessor introduced in 1985..."
"Max. CPU clock rate: 12.5 MHz to 40 MHz"
As you can see, 80386 was released a year earlier than R2000 and was about 1.5 times faster than MIPS implementation from the start.
The critical path is, usually, in addition/subtraction, which should be complete in one cycle in both 80386 and in R2000. To pipeline addition you need a superpipelined CPU, one that has several stages for computation. Even seemingly simple computation of condition codes can make clock cycle 10% longer (SPARC vs MIPS) if your CPU is just simply pipelined.
BTW, some Pentiums did computed 32-bit addition in two cycles, all in name of higher clock frequencies.
Interesting. An R2000 did run programs faster than a 80386, right? This was a few years before my time.
From a quick google now, it looks like the R2000 was about 3x better than the 80386 at Dhrystone MIPS/MHz. I guess an accurate comparison of how the R2000 and 80386 spent they gate budget and what they got in return would involve a lot of detail.
I remember my compsci professor giving us the computer architecture course in about 1997, and he dispaired at how all the clever RISC stuff in the Patterson and Hennessey seemed irrelevant when Intel could just throw money at the implementation (and fab, I guess) and produce competitive chips despite their (allegedly) inferior architecture.
My point is that you cannot get design much faster in terms of clock frequency by just pipelining. Pipeline unrolls state machine and overlaps different executions of the state machines. But the bottleneck, which is addition, is there in all designs and you need additional effort to break it.
(also MIPS has [i]ntelocked [p]ipeline [s]tages - that "IPS" in MIPS; I implemented it, I know - exception in execution should inform other stages about failure)
By the 1997 Intel has already bought Elbrus II design team, lead by Pentkovski [1]. That Pentkovski guy made Elbrus 2 a superscalar CPU with a stack machine front-end. E.g., Elbrus 2 executed stack operations in a superscalar fashion. You can entertain yourself by figuring out how complex or simple can that be.
So at the time your professor complained about Intel's inferior architecture being faster, that inferior architecture implementation has a translation unit inside it to translate x86 opcodes into superscalar-ready uops.
I think the Wikipedia page [1] agrees with your main point.
I said pipelining allowed you to increase the clock rate, which isn't the best thing to say.
The wiki page says, "instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (the eponymous "pipeline") performed by different processor units with different parts of instructions processed in parallel."
And, "This arrangement lets the CPU complete an instruction on each clock cycle. It is common for even-numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge. This allows more CPU throughput than a multicycle computer at a given clock rate, but may increase latency due to the added overhead of the pipelining process itself."
It's not apples-to-apples to compare raw clock rates between semiconductor processes; Intel's 386 was intially fabbed on 1.5μ then shrunk to 1.0μ process (Intel CHMOS III and IV), whereas MIPS R2000 was 2.0μ, fabless and relied on Sierra, Toshiba, then in 1987 LSI, IDT and other licensees [0][1].
Back in the 1980s/90s/2000s, Intel was consistently a process generation or two ahead of competitors. That was one of their main sources of advantage.
Just imagine if MIPS had been able to fab on Intel process.
And you are also confirm that in order to have higher clock frequency you need more than just pipelining.
Thank you.
I also think that 1.5x difference in clock speeds cannot be directly attributed to the difference between node size (lambda): difference in lambdas 1.3(3)=2.0/1.5 at the introduction of the 80386 and R2000 is noticeably less than 1.47=12.5/8.5.
Is there something about RISC that is still makes it better than CISC when it comes to per-watt performance? Seems like nobody has any success making an x86 processor that's as power efficient as ARM or RISC.
> Is there something about RISC that is still makes it better than CISC when it comes to per-watt performance?
CLASSIC CISC was micro-coded (for example, IBM S/360 have feature, you could make your custom microcode for compatibility with your inherited equipment, like IBM-1401 machines or IBM-7XXX series, or for other purposes), and RISC was with pipeline from birth.
Second thing, as I understand, many CISC existed as multiple chips board or even as multiple boards, so have great losses on wires, but RISC appear in 1990s as one die immediately (only external cache added as additional IC), but I could mistake on this.
> nobody has any success making an x86 processor that's as power efficient as ARM or RISC
Rumors said, Intel Atom (essentially CMOS version of Pentium first generations) was very good in mobiles, but ARM far succeed it on software support of huge number of power saving features (modern ARM SOC allows to turn off near any part of chip any time and OS support this), and because of lack of software support, smartphones with Intel have poor time on battery.
More or less official info said, that Intel made bad power conversion circuit, so Atom consumes too much in mode between deep sleep and full speed, but I don't believe them, as this is too obvious mistake for hardware developer.
Sometimes. Far from always. Some would have a complicated hardwired state machine. Some would have a complicated hardwired state machine and be pipelined. Some would have microcode and be pipelined (by flowing the microcode bits through the pipeline and of course dropping those that have already been used so less and less microcode bits survive at each stage).
Well, as I see you don't have enough bravery to answer simple question about purpose of mini-computers, so I will.
When computers first appeared, they was big, just because technology limitations made small machines very expensive to use, so scale used to make computations cheaper.
In early 1970s, technology advanced to stage, where become possible to make simplified versions of big computers for some limited tasks, still too expensive for wide use.
Simple illustration, IBM-3033 mainframe with 16M RAM could serve 17500 3270 terminals, and PDP of same time could about few tens (may be 50, I don't know exactly), so mainframes even when was very expensive, but given good cost per workplace.
Known example, PDP used to control one of scientific nuclear reactor. PDP chosen, not because it have best mips/price ratio, but because it was cheapest adequate machine for this task, so is affordable for limited budget.
Very long time, mini machines stay in niche of limited machines, used to avoid much more expensive full-scale mainframes. They used to control industrial automation (CNC), chemical factories and other small things.
Once appeared microcomputers (CPU on one chip), they begin eat mini's space from bottom, when mainframes continue to become more cost effective (more terminals with appearance of cheap modems, etc) and eat mini's space from top.
And in 1990s, when appeared affordable 32-bit microprocessors and became affordable Megabytes of RAM, mini's disappear, because their place was captured by micro's.
To be honest, I just don't know anything we could not name microcomputer now, as even IBM Z mainframes are now have single-chip processor and largest supercomputers are practically clouds of SOCs (NUMA architecture).
And I must admit, I still see PDP's (or VAX's) on enterprises, where they still control old machines from 1990s (they are very reliable even when limited from modern view, but still work).
As I remember, last symmetrical multiprocessor supercomputer was Cray Y-MP, later machines become ccNUMA or just NUMA or even cloud.
RISC/CISC is near the top of the list of "things emphasized in education that bear little relevance in practice". RISC isn't so much a single coherent design idea as a collection of ideas, some of which have won out (more register files), and some of which haven't (avoid instructions that take multiple clock cycles). The architectures from the days the "debate" was more relevant that have had the most success are the ones which most thoroughly blurred the lines between classical RISC and CISC--namely, Arm and x86.
> I think there is another blurry line between superscalar and VLIW architecture, too.
No, the line is pretty damn sharp. The core idea behind VLIW is that having hardware doing dynamic scheduling (as superscalar does) is silly and the compiler should be responsible for statically scheduling all instructions. The only blur here is that both VLIW and superscalar envision having multiple execution units that can be simultaneously scheduled with work, but who is responsible for doing that scheduling is pretty distinct.
> classical taxonomy of RISC/CISC dichotomy is basically non-existent nowadays
After digest information about IBM 360, I decided, we lost CISCs. One of most important feature of 360 was customizable microcode, which you could load on system boot and got effectively different hardware (like with FPGA emulators of Amiga's). It was widely used to emulate old hardware, like IBM 1401 or IBM 7xxx series. But I have not seen this feature in 390 documentation, so looks like their 360 emulation become just software (and with achievements of semiconductors in 1990s it looks like adequate, to switch to software emulation).
I must admit, ARM marketed feature of customized microcode, to add new instructions (they have standardized place in instruction set, named "custom coprocessor instructions", so if you have enough money, you could make special ARM with your additional instructions), but it is nothing if compare to 360.
Did you know for what purposes (targets) made mini-computers and why they was limited?
When computers first appeared, they was big, just because technology limitations made small machines very expensive to use, so scale used to make computations cheaper.
In early 1970s, technology advanced to stage, where become possible to make simplified versions of big computers for some limited tasks, still too expensive for wide use.
Simple illustration, IBM-3033 mainframe with 16MBytes RAM could serve 17500 3270 terminals, and PDP of same time could about few tens (may be 50, I don't know exactly), so mainframes even when was very expensive, but given good cost per workplace.
Known example, PDP used to control one of scientific nuclear reactor. PDP chosen, not because it have best mips/price ratio, but because it was cheapest adequate machine for this task, so is affordable for limited budget.
Very long time, mini machines stay in niche of limited machines, used to avoid much more expensive full-scale mainframes. They used to control industrial automation (CNC), chemical factories and other small things.
Once appeared microcomputers (CPU on one chip), first known on wide market in 1977, they begin eat mini's space from bottom, when mainframes continue to become more cost effective (more terminals with appearance of cheap modems, etc) and eat mini's space from top.
And in 1990s, when appeared affordable 32-bit microprocessors and became affordable Megabytes of RAM, mini's disappear, because their place was captured by micro's.
To be honest, I just don't know anything we could not name microcomputer now, as even IBM Z mainframes are now have single-chip processor and largest supercomputers are practically clouds of SOCs (NUMA architecture).
And I must admit, I still see PDP's (or VAX's) on enterprises, where they still control old machines from 1990s (they are very reliable even when limited from modern view, but still work).
As I remember, last symmetrical multiprocessor supercomputer was Cray Y-MP, later machines become ccNUMA or just NUMA or even cloud.
I don't know if a reduced instruction set can "take inspiration" from a big one, it just becomes a non reduced one.
Also these examples don't feel right for me: x64 has the same number of registers as CISCs traditionally did (eg m68k, z/architecture, vax). 32-bit x86 was just an exceptionally register-starved CISC. And SIMD postdates RISC vs CISC divide for a long time, both schools of architecture got SIMD around the same time.
But the divide has become less relevant because originally instruction set affected chip area a lot, and there were big gains to be had by the quantitative approach of benchmarking compiled apps with different proposed instruction sets and seeing what runs fastest when transistors are spent on hot instructions vs execution engine resources. Nowadays we have more transistors than we know what to do with, and just put in lots of cores that end up sitting idle because of diminishing returns trying to speed up cores with more transistors.
I agree that the line is pretty thin, but would draw it as: fixed-width versus variable-width. I think the M? line of Apple CPUs, with extremely-wide parallel decode, has been a game changer. The performance per watt is really off the charts. That's partly due to integrated RAM and all, but mostly due to microarchitectural changes, which I believe to be a massive step function in superscalar bandwidth (wider decode, huge ROB, huge numbers of ports). It seems like the power-hungry decode stage has been tamed, and I think this is because of fixed-width instructions in arm.
> RISC designs also taken some inspirations from CISC (such as having SIMD/vectorization units)
I think that one went the other way: for example the PlayStation 2 used a MIPS chip with 256-bit SIMD instructions (the TMPR 5900) as well as a dedicated GPU (the so-called “emotion engine”)
With all of the sanctions China is really starting to push RISC forward, challenge ARM, and they are starting to find success. That "No" has absolutely changed to a "Maybe".
A 20,000 gate minimal RISC-V RV32E controller CPU isn't going to use μops. In 2024, RISC-V has turned that 2015 No into an unqualified Yes even if the microarchitecture of more complex OOO RISC-V systems resemble the microarchitectures of similarly complex x86 and ARM CPUs.
To the extent that RISC-V is successful it's due to openness/freeness; RISC has nothing to do with it. An open "CISC-V" community would have been just as successful (and people wouldn't gripe about instruction fusion).
I have made a few of those RISC-V CPUs. The minute the "M" instruction set shows up (with division), a macro/micro-op split becomes worth it if you want to minimize gate count or maximize speed.
Also, at the end of the day, they all turned into μops. If I remember Jim Keller correctly, ARM and x86, they are basically the same in the back nowadays, its just the frontend and their decoding units are different. And that's why he strongly suggested AMD to also adapt Zen design with ARM ISA during his tenure there, oh I think it is called K12.
I think there is another blurry line between superscalar and VLIW architecture, too.